An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara University of Tokyo Karlsruhe, Sep 15, 2008 Kunihiko.

An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara University of Tokyo ESA2008@Universitat Karlsruhe, Sep 15, 2008 Kunihiko Sadakane Kyushu University

Problem: Finding the longest previous factors (matching) Input : A text T[0…n-1] At all position k, report the longest substring T[k-len+1…k] that also occurs in previous positions (history) T[pos…pos+len-1] = T[k-len+1…k] –c.f. LZ77, LZ-factorization abracadabr a (pos, len) = (0, 4) abracadabra (pos, len) = (5, 2) d

Applications Data Compression –LZ77, Prediction by Partial Matching Pattern Analysis –Log analysis Data Mining

Previous approach Sequential search on the fly –O(n 2 ) time for a text of length n Offline- Index approach –Read an whole text beforehand, and build an index (suffix array/trees) for it. –Search the match using the index [Chen 07] [Chen 08] [Crochemore 08] [Kolpakov 01] [Larsson 99] –6n bytes, and O(n log n) time [Chen 08] Suffix Arrays with Range Minimum Query

New Problem: Online finding the longest previous factors Report match information just after reading each character –A case where we don’t know the length of data beforehand, e.g. streaming data Previous approaches cannot deal with this problem

Our approach for new problem Online construction of enhanced prefix arrays –Update an index just after reading each character –Although many methods used in LZ77 cannot report the longest match, our method can. Succinct data structures –Keep all information very compactly; using about the same space for an original text

Prefix arrays Keep NOT suffix arrays (SA), but prefix arrays (PA) –because when a character is added at the last of a text, SA may cause  (n) changes, but PA not –In PA, prefixes are sorted in the reverse-lexicographic order T=aaaa 0 $ 1 a$ 2 aa$ 3 aaa$ 4 aaaa$ 0 $ 4 aaaaz$ 3 aaaz$ 2 aaz$ 1 az$ 5 z$ SA for T SA for T new 0 $ 1 $a 2 $aa 3 $aaa 4 $aaaa PA for T 0 $ 1 $a 2 $aa 3 $aaa 4 $aaaa 5 $aaaaz PA for T new T new =aaaaz

Our idea Weiner’s suffix tree construction algorithm –Insert the suffixes from the shortest ones –Modify it to the insert prefixes form the shortest ones –Similar idea is used for the incremental construction of compressed suffix arrays [Chan, et. al 2007], [Lippert 2005] We extend this work to the succinct version –Our algorithm reports matching information as a by- product of construction –Do not require tree representation, we just use array information

Preliminary: Dynamic Rank/Select Dictionary (DRSD) For an text T[0…n-1], DRSD supports: –rank(T, c, i): return the number of c in T[0…i] –select(T, c, i): return the position of i-th c in T –insert(T, c, i): insert c at T[i] –delete(T, i): delete T[i] These operations can be supported in time (O(logn) time if  < logn), bits space where  is the alphabet size [Lee, et. al. 07],

Preliminary: Range Minimum Query (RMQ) Given an array E[0…n-1] of elements from totally ordered set, rmq(E, l, r) returns the index of the smallest element in E[l…r] –i.e. rmq(E, l, r) = argmin k ∈ [l, r] E[k] –return the leftmost such element in the tie In the static case, RMQ can be supported in O(1) time using 2n+o(n) bits space [Fischer, 2007] In the dynamic case, RMQ/insert/delete can be supported in O(Tlogn) time using O(n) bits if the lookup cost (E[i]) is O(T)

Data structures Keep the following data structures for T[0…k] –Assume T[0]=$, $ is the unique smallest character B[0…k]: (Prefix-) BW-transformed Text –B[i] = T[PA[i]+1] and B[i] = $ if PA[i]=k H[0…k]: Height Array –will be explained in the next slide C[0…  -1] : Cumulative Array –C[c] = the total number of characters c’ s.t. c’ < c in T s: The position for the next prefix to be inserted

iPAprefixBH 00$a0 11$ab1 24$abaab1 33$abaa3 48$abaababa$3 56$abaabab0 62$aba2 75$abaaba2 87$abaababa0 T = $abaababa

iPAprefixBH 00$a0 11$ab1 24$abaab1 33$abaa3 48$abaababa$3 56$abaabab0 62$aba2 75$abaaba2 87$abaababa0 T = $abaababa PA stores the end position of each prefix (we will omit this) Prefix stores prefixes sorted in the reverse- lexicographic order (Neither PA nor prefix are stored explicitly) We can examine PA[i] by using SA lookup operation using O(log 2 n) time as in FM-index [Ferragina 00]

iprefixBH 0$a0 1$ab1 2$abaab1 3$abaa3 4$abaababa$3 5$abaabab0 6$aba2 7$abaaba2 8$abaababa0 T = $abaababa B stores the next character for each prefix (Burrows Wheeler’s transform for prefix arrays)

iprefixBH 0$a0 1$ab1 2$abaab1 3$abaa3 4$abaababa$3 5$abaabab0 6$aba2 7$abaaba2 8$abaababa0 T = $abaababa H stores the length of the longest common suffix between adjacent prefixes

iprefixBH 0$a0 1$ab1 2$abaab1 3$abaa3 4$abaababa$3 5$abaabab0 6$aba2 7$abaaba2 8$abaababa0 T = $abaababa s = 4 s denotes the position where $ in B, and the longest prefix is placed.

iprefixBH 0$a0 1$a$ab1 2$abaab1 3$abaa3 4$abaababa$3 5$abaabab0 6$aba2 7$abaaba2 8$abaababa0 T = $abaababa C[$] ＝０ C[a] ＝ 1 C[b] ＝ 6 C[c] = the number of characters c’ that is smaller than c in T(=B)

iprefixBH 0$a0 1$ab1 2$abaab1 3$abaa3 4$abaababa$3 5$abaabab0 6$aba2 7$abaaba2 8$abaababa0 T = $abaababaa The next character `a’ comes !

iprefixBH 0$a0 1$ab1 2$abaab1 3$abaa3 4$abaababa$3 5$abaabab0 6$aba2 7$abaaba2 8$abaababa0 T = $abaababaa Replace $ in B[s] with a (because $ is placed in the position of the longest prefix) a

iprefixBH 0$a0 1$ab1 2$abaab1 3$abaa3 4$abaababaa3 5$abaabab0 6$aba2 7$abaaba2 8$abaababa0 T = $abaababaa Count the number of a in B[0…s-1] = rank(B, a, s-1) = 2 Find the position for the new prefix $abaababaa

iprefixBH 0$a0 1$a$ab1 2$abaab1 $abaababaa$ 3$abaa3 4$abaababaa3 5$abaabab0 6$aba2 7$abaaba2 8$abaababa0 T = $abaababaa Insert $abaababaa at 3 rd position in a C[a]+rank(B, a, s-1) =3 s := C[a]+rank(B, a, s-1), insert(B, s, $)

iprefixBH 0$a0 1$ab1 2$abaab $abaababaa$ 3$abaa3 4$abaababaa3 5$abaabab0 6$aba2 7$abaaba2 8$abaababa0 T = $abaababaa Update H This is actually the length of the longest match in the history

iprefixBH 0$a0 1$ab1 2$abaab1 3$abaa3 4$abaababa$3 5$abaabab0 6$aba2 7$abaaba2 8$abaababa0 T = $abaababa Recall that in the previous step, $abaa and $aba are placed in the prefixes whose B is `a’ These positions can be found by using rank and select c. f. succ(T, `c’, s) = select(T, c, rank(T, s, c))

iprefixBH 0$a0 1$ab1 2$abaab1 3$abaa3 4$abaababa$3 5$abaabab0 6$aba2 7$abaaba2 8$abaababa0 T = $abaababa RMQ （ H, 4, 6) = 5, H[5] = 0 Therefore RMQ （ H, 4, 6) + 1 is the new value for the next H entry

iprefixBH 0$a0 1$ab1 2$abaab1 3$abaa3 4$abaababa$3 5$abaabab0 6$aba2 7$abaaba2 8$abaababa0 T = $abaababa RMQ （ H, 3, 3) = 3, H[3] = 3 Therefore RMQ （ H, 3, 3) + 1 is the new value for the next H entry

iprefixBH 0$a0 1$ab1 2$abaab4 $abaababaa$1 3$abaa3 4$abaababaa3 5$abaabab0 6$aba2 7$abaaba2 8$abaababa0 T = $abaababaa rmq(H, 3, 3) + 1 rmq(H, 4, 6) + 1

iprefixBH 0$a0 1$ab1 2$abaab4 $abaababaa$1 3$abaa3 4$abaababaa3 5$abaabab0 6$aba2 7$abaaba2 8$abaababa0 T = $abaababaa Report max(4, 1) = 4 as the length of the longest factor and report the position of $abaa as SA lookup [2] - len = 0 Report (pos=0, len=4) as the max. matching

Overall algorithm All operations are rank, select, RMQ

Overall Analysis H is stored in 2n bits [Sadakane, Soda 02] –naïve representation requires O(n log n) bits –requires one SA lookup operation to decode B is stored in nlog  + o(nlog  ) bits –by using dynamic rank/select dictionary The bottleneck of our algorithm is rmq(H, I, r) which requires O(log 3 n) time –SA lookup requires O(log 2 n) time

Overall Analysis (cont.) We can solve the online longest previous factor problem in O(log 3 n) time for each character, using nlog 2  + o(nlog  ) + O(n) bits of space –where  is the alphabet size, and n is the length of a text

Simulating window buffer If the working space is limited, we often discards the history from the oldest ones We can simulate this by using the almost the same operations as in the insertion operation We actually do not discard a character but ignore it –If we actually discard an oldest character, it may cause  (n) changes in B and H –The effect of discarded character is remained (prefixes are sorted according to the discarded characters) –But this does not cause the problem if we only report the matching information up to the history size

Experiments In experiment, we used a simpler data structure (algorithm is same) –B and H is store in the balanced binary tree –Each leaf stores the small block of B and H –We call this implementation as OS Compare OS with other offline algorithms –Require to read the whole text beforehand –CPSa, CPSd: SA+LCP with stack [Chen, et. al. 07] –CPS6n: SA with RMQ [Chen, et. al. 08] –kk-lz: mreps, specialized for σ=4 [Kolpakov 01]

Peak memory usage in bytes per input symbol The space of OS is smallest in many real data especially when the values in H is small

Runtime in milliseconds for searching the longest previous factors OS is about 2 ～ 10 times slower than the fastest ones due to the dynamic operations

Conclusion Solve online longest matching problem by using enhanced prefix arrays –Simple and easy to implement –Require about 3 ～ 6 times space of the input text –Actually this is a by-product of construction of compressed suffix trees c.f. Weiner’s algorithm Simple; and much room for improvements –by using better rank/select/rmq implementation

Future work Construction of compressed suffix trees –Update the parenthesis tree efficiently –Actually, the time complexity for this is smaller Practical improvements –Currently, dynamic succinct data structure is not efficient due to cache misses, and memory fragmentation –Approximated version of longest matching problem; enough for many application Thank you for you attention !

Weiner’s suffix tree’s construction alg.. $a $abraca $abracada $abra $ab $abracad $abr a $abrac $ab $abracad $abr $abrac $abr $ a $abrac $ $abracad $abr $abrac $abr $ ba $abracada a $abrac $ $abracad $ $abrac $abr $ ab $abracada $a $abraca $abracada $abra $ab $abracadab $abracad $abr abr $abracad $a $abraca $abracada $abra $ab $abracadab $abracad $abr $abracadabr

An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara University of Tokyo Karlsruhe, Sep 15, 2008 Kunihiko.

Similar presentations

Presentation on theme: "An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara University of Tokyo Karlsruhe, Sep 15, 2008 Kunihiko."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara University of Tokyo Karlsruhe, Sep 15, 2008 Kunihiko.

Similar presentations

Presentation on theme: "An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara University of Tokyo Karlsruhe, Sep 15, 2008 Kunihiko."— Presentation transcript:

Similar presentations

About project

Feedback