Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo http://researchmap.jp/sada/resources/

2 Standard Data Structures for Strings Operations –Number of occurrences and locations of a pattern –Common substrings, maximal patterns –Alignment of two strings Standard data structures –Suffix trees [1,2] –Suffix arrays [3] Size: string size + O(n log n) bits –DNA sequence of a human ： 3 billion letters (750MB) –Its suffix tree ： 40GB

3 Suffixes of a String Strings made by omitting letters at the beginning of a string T. There are n suffixes of a string of length n Any substring of T is a prefix of a suffix of T = T 1 ababac$ 2 babac$ 3 abac$ 4 bac$ 5 ac$ 6 c$ 7 $

4 Suffix Arrays [3] An array storing pointers to suffixes which are lexicographically sorted. Size n log n + n log |A| bits –A: the alphabet –|A|: alphabet size Time for searching a pattern P O(|P| log n) time 1 7 $ 2 1 ababac$ 3 3 abac$ 4 5 ac$ 5 2 babac$ 6 4 bac$ 7 6 c$ SA i

5 Compressed Suffix Arrays (CSA) [4,5] Instead of storing SA, store  [i] = SA -1 [SA[i]+1] Size: O(n log |A|) bits Time for search for P O(|P| log n) time 0 1 7 $ 5 2 1 ababac$ 6 3 3 abac$ 7 4 5 ac$ 3 5 2 babac$ 4 6 4 bac$ 1 7 6 c$ SA i

6 How to Compute  1.Construct the suffix array SA 2.Radix-sort i w.r.t. (T[SA[i]-1], i) 1.Count the number of occurrences of each character in T 2.For i=1,2,...,n, c = T[SA[i]-1] 3. write i in the range of  corresponding to c Time complexity: O(n) 0 1 7 $ 5 2 1 ababac$ 6 3 3 abac$ 7 4 5 ac$ 3 5 2 babac$ 4 6 4 bac$ 1 7 6 c$ SA i

7 Why can be compressed? Suffixes are stored in lexicographic order Lexicographic order does not change if the first letter is removed for suffixes sharing the same first letter 6 いかいないかいるかいるいるいるか 10 いかいるかいるいるいるか 4 いないかいないかいるかいるいるいるか 8 いないかいるかいるいるいるか 15 いるいるいるか 17 いるいるか 19 いるか 1 いるかいないかいないかいるかいるいるいるか 12 いるかいるいるいるか 21 か 3 かいないかいないかいるかいるいるいるか 7 かいないかいるかいるいるいるか 14 かいるいるいるか 11 かいるかいるいるいるか 5 ないかいないかいるかいるいるいるか 9 ないかいるかいるいるいるか 16 るいるいるか 18 るいるか 20 るか 2 るかいないかいないかいるかいるいるいるか 13 るかいるいるいるか SA 12 14 15 16 17 18 19 20 21 0 3 4 5 9 1 2 6 7 10 11 13 

8 Properties of CSA If i < j, T[SA[i]]  T[SA[j]] If i < j and T[SA[i]] = T[SA[j]],  [i] <  [j] Proof ： If T[SA[i]] = T[SA[j]], their lex. orders are determined by letters at position 2 or latter. Since i < j,T[SA[i]+1..n] < T[SA[j]+1..n] Let SA[i’] = SA[i]+1, SA[j’] = SA[j]+1, then i’ < j’ That is, i’ = SA -1 [SA[i]+1] =  [i] <  [j] = j’ 0 1 7 $ 5 2 1 ababac$ 6 3 3 abac$ 7 4 5 ac$ 3 5 2 babac$ 4 6 4 bac$ 1 7 6 c$ SA i

9 Succinct Data Structure for Bit Vectors B: 0,1 vector of length n B[0]B[1]…B[n  1] lower bound of size = log 2 n = n bits queries –rank(B, x): number of ones in B[0..x]=B[0]B[1]…B[x] –select(B, i): position of i-th 1 from the head (i  1) Theorem ： rank and select on a bit-vector of length n is computed in constant time on word RAM with word length  (log n) bits, using n+O(n log log n /log n) bits. B = 1001010001000000 035 9 n = 16

10 How to Encode   ’[i] = T[SA[i]]  n +  [i] is used –  [i] =  ’[i] mod n –T[SA[i]] =  ’[i] div n  ’[i] (i = 1,2,...,n) forms an increasing sequence –n(2+log  ) bits $: 2 a: 5, 8, 9 c: 3, 4 g: 1, 6, 7 $: 000010 a: 010101, 011000, 011001 c: 100011, 100100 g: 110001, 110110, 110111

11 How to Encode  ’ MSB log n bits of binary encoding of  ’[i] –Encode the difference from the preceding value in unary code –Maximum 2n bits (#ones = n ， #zeros  n) Lowest log  bits of  ’[i] are stored as it is –n log  bits $: 2 a: 5, 8, 9 c: 3, 4 g: 1, 6, 7 $: 000010 a: 010101, 011000, 011001 c: 100011, 100100 g: 110001, 110110, 110111 1, 000001, 01, 1, 001, 01, 0001, 01, 1 10, 01, 00, 01, 11, 00, 01, 10, 11

12 Decoding of  ’ Upper digits ： x = select(H,i)  i Lower digits ： y = L[i]  ’[i] = x   + y Time ： O(1) Space: n(2+log  ) + O(n log log n/log n) $: 2 a: 5, 8, 9 c: 3, 4 g: 1, 6, 7 H: 1, 000001, 01, 1, 001, 01, 0001, 01, 1 L: 10, 01, 00, 01, 11, 00, 01, 10, 11

13 Compressing  Divide  [i]’s according to T[SA[i]] Encode each S(c) ： In total H 0  log  (equality holds if p 1 = p 2 = …) ( :Prob. of letter c)

14 How to Access SA[i] For i multiple of log n, store SA[i] k = 0; w = log n; while (i % w != 0) –i =  [i]; k++; return SA 2 [i / w] - k; 0 8 0 1 7 $ 5 2 1 ababac$ 6 3 3 abac$ 7 4 5 ac$ 3 5 2 babac$ 4 6 4 bac$ 1 7 6 c$ SA  i 0 8 1 3 2 4 SA 2 n = 8 w = 3 Access time: O(log n) time on average

15 T E B D E B D D A D D E B E B D C SA 81452121671569310134111 SA 2 2341 12345678910111213141516 B 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 10 8 9 11 13 15 1 6 7 12 14 16 2 3 4 5  B : n+o(n) bits Access time: O(log n) time (worst case) B[i]=1  SA[i] is a multiple of log n Store SA[i] if it is a multiple of log n in SA 2 k = 0; w = log n; while (B[i] != 1) i =  [i]; k++; return SA 2 [rank ( B, i)]  w  k;

16 Hierarchical Representatino of  At level i –Consecutive 2 i letters of T is regarded as a letter –Entropy of the string does not increase BDEBDDADDEBEBDC$ T E B D E B D D A D D E B E B D C SA 81452121671569310134111 SA 1 47168352 SA 2 2341 B 1 1 0 1 1 1 0 0 1 0 0 1 0 1 0 0 DEBEBDEBDDADBDC$

17 Size of the Data Structure If the number of levels is 1/   : 1/   n(3+H 0 ) bits SA 1/  : n/log n  log n = n bits B:  n + n/2 + n/4 +...  2n bits Total: bits Time to compute SA[i]: time

18 Searching Substrings T E B D E B D D A D D E B E B D C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SA 8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11 A B B B B C D D D D D D E E E E D D D D E A C D D E E B B B B C D E A E B B D D D E D E C D E 10 8 9 11 13 15 1 6 7 12 14 16 2 3 4 5  r 1 2 2 2 2 3 4 4 4 4 4 4 5 5 5 5 D 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 Binary search can be done without using actual values of SA 1 2 3 4 5 C A B C D E

19 Backward Search To search for P=P[1..p] for (k = p; k >=1; k--) C[$]=[1,1] C[a]=[2,4] C[b]=[5,6] C[c]=[7,7] O(p log n) time 0 1 7 $ 5 2 1 ababac$ 6 3 3 abac$ 7 4 5 ac$ 3 5 2 babac$ 4 6 4 bac$ 1 7 6 c$ SA i

20 Binary search w.r.t.  : O(log n) time Search time for P: O(|P| log n) time

21 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 T E B D E B D D A D D E B E B D C A B B B B C D D D D D D E E E E 10 8 9 11 13 15 1 6 7 12 14 16 2 3 4 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SA 8 14 5 2 12 16 7 15 6 9 3 10 13 4 1 11  Partial Decoding of String To decode T[9..13] = DDEBE 1. Compute i=SA -1 [9]=10 2. Find the first letter of the suffix with lex. order i 3. Traverse  from i=10 1 2 3 4 5 C A B C D E

22 Functions of Compressed Suffix Arrays lookup(i): returns SA[i] (O(log n) time) inverse(i): returns SA -1 [i] (O(log n) time)  [i]: returns SA -1 [SA[i]+1] (O(1) time) substring(i,l): returns T[SA[i]..SA[i]+l-1] –O(l) time –(T[SA[i] is computed by rank on length-n 0,1 vector)

23 Problems of CSA Size is n(H 0 (S)+O(1)) bits Want to compress into nH k (S)+o(n)

24 References [1] P. Weiner. Linear Pattern Matching Algorithms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, pages 1–11, 1973. [2] E. M. McCreight. A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM, 23(12):262–272, 1976. [3] Udi Manber, Gene Myers. Suffix arrays: a new method for on-line string searches, Proc. SODA, 1990. [4] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing, 35(2):378–407, 2005. [5] Kunihiko Sadakane: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2): 294-313 (2003). [6] M. Burrows, D. Wheeler. A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation, 1994.

25 [7] John G. Cleary, Ian H. Witten: A comparison of enumerative and adaptive codes. IEEE Transactions on Information Theory 30(2): 306- 315 (1984) [8] Kunihiko Sadakane: A Modified Burrows-Wheeler Transformation for Case-Insensitive Search with Application to Suffix Array Compression. Data Compression Conference 1999: 548 [9] P. Ferragina and G. Manzini. Indexing compressed texts. Journal of the ACM, 52(4):552–581, 2005.

26 Block-Sort Compression Algorithm Burrows, Wheeler 1994 [6] Compression ratio is better than gzip, close to PPM [7] Compression is faster than PPM Decompression is much faster –Suitable for distributing data

27 Block-Sort Compression Algorithm ababac$ c$bbaaa 3441411 011 00100 00100 1 00100 1 1 BW transform suffix sorting MTF transform Huffman code Arithmetic code  code 11 20 10 30 11 400 100 500 101

28 Suffix Array acagcagg$ cagcagg$ agcagg$ gcagg$ cagg$ agg$ gg$ g$ $ T = acagcagg$ 123456789123456789 $ acagcagg$ agcagg$ agg$ cagcagg$ cagg$ g$ gcagg$ gg$ 913625847913625847 SA Lexicographic sort Suffix array

29 BW Transform BW[i] = T[SA[i]  1] It consists of characters sorted in the lex. order of following suffixes BW is a permutation of T T can be recovered from BW BW is compressed by a simple (order-0) compression algorithm g $ $ acagcagg c agcagg$ c agg$ a cagcagg$ g cagg$ g g$ a gcagg$ a gg$ T = acagcagg$BW = g$ccaggaa 913625847913625847 SA

30 Inverse BW Transform T can be recovered from BW SA can be also recovered [8] g $ c c a g g a a $ acagcagg$ agcagg$ agg$ cagcagg$ cagg$ g$g$ gcagg$ gg$ SA 9 1 3 6 2 5 8 4 7

31 FM-index [9] g $ c c a g g a a $ a a a c c g g g 1 2 3 4 5 6 7 8 9 $: 0 a: 1 c: 4 g: 6 C Pattern search can be done using only BW c = P[p] l = C[c]+1, r = C[c+1] while (--p  0) { c = P[p] l = C[c]+rank c (BW,l  1)+1 r = C[c]+rank c (BW,r) } [l,r] is the lex. order of P To search for P=P[1, p]

32 Substring Search Given SA range [l,r] for pattern P, range [l’,r’] for cP is computed by g $ c c a g g a a $ acagcagg$ agcagg$ agg$ cagcagg$ cagg$ g$g$ gcagg$ gg$ SA 9 1 3 6 2 5 8 4 7 $: 0 a: 1 c: 4 g: 6 C l’ = C[c]+rank c (BW,l  1)+1 r’ = C[c]+rank c (BW,r) g :[7,9] ag :[3,4] cag:[5,6]

33 LF mapping C $ a c gC $ a c g 0 1 4 60 1 4 6 LF[5]=C[a]+rank a (BW,5) =1+1 =2 SA[5]=2 SA[2]=2-1 g $ c c a g g a a $ a a a c c g g g 1 2 3 4 5 6 7 8 9 LF[i] represents lex. order of SA[j  1] for j = SA[i]

34 If BW is stored using the wavelet tree, rank can be computed in O(log  /log log n) time Pattern search takes O(|P| log  /log log n) time Size of BW: nH 0 (BW) + O(n log  /log log n) bits If indexes for lookup/inverse store every d suffixes –Size: O(n log n/d) bits –Time: O(d log  /log log n) To make the index size o(n), set d = log 1+  n

35 Entropy of String Definition: order-0 entropy H 0 of string S (p c : probability of appearance of letter c) Definition: order-k entropy –assumption: Pr[S[i] = c] is determined from S[i  k..i  1] (context) –n s : the number of letters whose context is s –p s,c : probability of appearing c in context s abcababc context

36 Higher-Order Compression of Strings In the string after BWT, characters with the same context are gathered Compress substring for each context into H 0 ⇒ Achieve H k in total g $ $ acagcagg c agcagg$ c agg$ a cagcagg$ g cagg$ g g$ a gcagg$ a gg$ context = $ context = a context = c context = g

37 Summary: FM-index Assume  = polylog(n) Index size: nH k (S) + o(n) bits Pattern search: O(|P|) time lookup/inverse: O(log 1+  n) time Decode of a substring of length l:O(l + log 1+  n) time

38 References [1] P. Weiner. Linear Pattern Matching Algorithms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, pages 1–11, 1973. [2] E. M. McCreight. A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM, 23(12):262–272, 1976. [3] Udi Manber, Gene Myers. Suffix arrays: a new method for on-line string searches, Proc. SODA, 1990. [4] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing, 35(2):378–407, 2005. [5] Kunihiko Sadakane: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2): 294-313 (2003). [6] M. Burrows, D. Wheeler. A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation, 1994.

39 [7] John G. Cleary, Ian H. Witten: A comparison of enumerative and adaptive codes. IEEE Transactions on Information Theory 30(2): 306- 315 (1984) [8] Kunihiko Sadakane: A Modified Burrows-Wheeler Transformation for Case-Insensitive Search with Application to Suffix Array Compression. Data Compression Conference 1999: 548 [9] P. Ferragina and G. Manzini. Indexing compressed texts. Journal of the ACM, 52(4):552–581, 2005.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Similar presentations

Presentation on theme: "Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Similar presentations

Presentation on theme: "Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo"— Presentation transcript:

Similar presentations

About project

Feedback