Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1, Shunsuke Inenaga 2, Akira Ishino 1, Ayumi Shinohara 1, Tomoyuki Nakamura 1, Kazuo Hashimoto 1 1 Graduate School of Information Sciences Tohoku University, Japan 2 Department of Computer Science and Communication Engineering, Kyushu University, Japan
Background and motivations
What is compressed string algorithm? A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : input text
What is compressed string algorithm? A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : input text find palindromes output mm isi zz iprefrepi borroworrob wasitabarorabatisow oo :
What is compressed string algorithm? A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : find palindromes output mm isi zz iprefrepi borroworrob wasitabarorabatisow oo : decompress e)%eARY)(ReJD)OIHOIFEnkkdi we02kfo)J”LPEPJ9wEOW*# eO … compressed text One solution would be to decompress the compressed text. The decompressed size can be exponentially large with respect to the compressed size. decompressed text
Goal of algorithms for Compressed strings Process the compressed text without decompression. Processing time should be polynomial in n. – Decompressed size can be exponentially large with respect to n. n : the size of compressed text
Compressed schemes run-length encoding Lempel-Ziv grammar based compression : Straight Line Program [Rytter2003] Resulting achieve of most practical compression methods can be transformed into SLP generating the same original text. [Rytter2003] Resulting achieve of most practical compression methods can be transformed into SLP generating the same original text.
SLP T T : sequence of assignments X 1 = expr 1 ; X 2 = expr 2 ; … ; X n = expr n ; X k : variable, a ( a X i X j ( i, j < k ). expr k : Definition of Straight Line Program (SLP) SLP T for string w is a CFG in Chomsky normal form s.t. L( T ) = {w}.
Straight Line Program (SLP) Example X 1 = a X 2 = b X 3 = X 1 X 2 X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 n N N = O(2 n ) T = SLP
Straight Line Program (SLP) Example X 1 = a X 2 = b X 3 = X 1 X 2 X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 n N N = O(2 n ) T = SLP X8X8 X7X7 X5X5
Efficient algorithms for compressed strings substring matching – Karpinski et al (1996) O(n 4 logn) time – Miyazaki et al (1997) O(n 4 ) time – Lifshits (2006) O(n 3 ) time minimum period – Karpinski et al (1996) O(n 4 logn) time – Lifshits (2006) O(n 3 logN) time all squares – Gasieniec et al (1994) O(n 6 log 5 N) time
Hardness results Subsequence pattern matching – Lifshits and Lohrey (2006) NP-hard Longest common subsequence – Lifshits and Lohrey (2006) NP-hard Hamming distance – Lifshits (2007) #P-complete Is there any reasonable comparison measurement for compressed strings?
a b a a b a a a b b a a String comparison measures a b a a b a a a b b a a Hamming distance Longest common subsequence Longest common substring #P-comprete [Lifshits 07] NP-hard [Lifshits and Lohrey06] ?? O(N)O(N) uncompressed text compressed text O(N 2 / logN)O(N)O(N) we solve this problem a b a a b a a a b b a a
Our results
Problem Given two SLP T and S that are descriptions of text T and S respectively, compute LCStr(T, S). LCStr(T, S) : the length of longest common substring of T and S n : the total size of the input SLP Our Result1: Longest Common Substring Theorem O(n 4 logn) LCStr(T, S) can be computed in O(n 4 logn) time O(n 3 ) using O(n 3 ) space. Theorem O(n 4 logn) LCStr(T, S) can be computed in O(n 4 logn) time O(n 3 ) using O(n 3 ) space.
Problem Given SLP T, compute (compressed representations) the set of all palindromes of T. n : the size of SLP T N : the length of original text T (note that N = O(2 n ) Previous best result: O(n 5 log 4 N) time Our Result2: palindromes [Gasienec et al 1996] Theorem O(n 4 )O(n 2 ) The problem can be solved in O(n 4 ) time using O(n 2 ) space. Theorem O(n 4 )O(n 2 ) The problem can be solved in O(n 4 ) time using O(n 2 ) space.
Details of our algorithm Computing longest common substring Computing palindromes (omitted in this talk)
Property of common substrings (1/3) For each common substring Z of string S and T, there always exists a variable X i = X l X r and Y j = Y L Y R such that: – Z is a common substring of X i and Y j – Z contains an overlap between X l and Y R common substring Z Z Z Z XiXi XlXl XrXr YjYj YLYL YRYR w w Overlap
Property of common substrings (2/3) For each common substring Z of string S and T, there always exists a string w such that: – w is a substring of Z – w is an overlap of variables of S and T w w XiXi XlXl XrXr YjYj YLYL YRYR Overlap
Property of common substrings (3/3) For each common substring Z of string S and T, there always exists a string w such that: – Z can be calculate by expanding w common substring w w Z Z Z Z XiXi XlXl XrXr YjYj YLYL YRYR Extend Process Overlap
For any strings X, Y, Overlaps (OL) the set of the lengths of overlaps of X and Y. X Y
a a b a a b a Overlaps Example OL (“aabaaba”, “abaababb”) = {1, 3, 6} XlXl a b a a b a a b a b YRYR YRYR YRYR
Computing Overlaps [Karpinski et al 1996] Lemma For any variables X i and X j of SLP T, OL(X i, X j ) can be represented by O(n) arithmetic progressions. XiXi YjYj Theorem For any SLP T, OL(X i, X j ) can be computed in total of O(n 4 logn) time and O(n 3 ) space.
a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL
a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL match
a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL match
a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL match
a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL mismatch
How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YrYr YlYl a b a ∈ OL(X l, Y R ) mismatch
How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YrYr YlYl a b a ∈ OL(X l, Y R ) We are not allowed to process character by character.
First-mismatch function [Karpinski et al 1996] input : SLP variables X i and Y j, integer k output : position of first mismatch Mismatch k YjYj a b a b a a b a b a a b XiXi a b a b a b a a b a pp [p]}
First-mismatch function [Karpinski et al 1996] Lemma Provided that the sets of overlaps are already computed, FM(X i, Y j, k) can be computed in O(nlogn) time.
Extending overlaps using FM function Lemma Extending overlaps can be done by O(n) calls of FM function.
O(n 2 ) items pseudo-code Computing longest common substring O(n) calls of FM function. O(nlogn) times Totally, LCStr (S, T) can be computed in O(n 2 ×n×nlogn ) = O ( n 4 logn ) time.
Conclusions Computing longest common substring from compressed string – O(n 4 logn) time and O(n 3 ) space Computing all palindromes from compressed string – O(n 4 ) time and O(n 2 ) space
Thank you for your attention.