Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1, Shunsuke Inenaga 2, Akira Ishino 1, Ayumi Shinohara.

Similar presentations


Presentation on theme: "Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1, Shunsuke Inenaga 2, Akira Ishino 1, Ayumi Shinohara."— Presentation transcript:

1 Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1, Shunsuke Inenaga 2, Akira Ishino 1, Ayumi Shinohara 1, Tomoyuki Nakamura 1, Kazuo Hashimoto 1 1 Graduate School of Information Sciences Tohoku University, Japan 2 Department of Computer Science and Communication Engineering, Kyushu University, Japan

2 Background and motivations

3 What is compressed string algorithm? A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : input text

4 What is compressed string algorithm? A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : input text find palindromes output mm isi zz iprefrepi borroworrob wasitabarorabatisow oo :

5 What is compressed string algorithm? A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : find palindromes output mm isi zz iprefrepi borroworrob wasitabarorabatisow oo : decompress e)%eARY)(ReJD)OIHOIFEnkkdi we02kfo)J”LPEPJ9wEOW*# eO … compressed text One solution would be to decompress the compressed text. The decompressed size can be exponentially large with respect to the compressed size. decompressed text

6 Goal of algorithms for Compressed strings Process the compressed text without decompression. Processing time should be polynomial in n. – Decompressed size can be exponentially large with respect to n. n : the size of compressed text

7 Compressed schemes run-length encoding Lempel-Ziv grammar based compression : Straight Line Program [Rytter2003] Resulting achieve of most practical compression methods can be transformed into SLP generating the same original text. [Rytter2003] Resulting achieve of most practical compression methods can be transformed into SLP generating the same original text.

8 SLP T T : sequence of assignments X 1 = expr 1 ; X 2 = expr 2 ; … ; X n = expr n ; X k : variable, a ( a  X i X j ( i, j < k ). expr k : Definition of Straight Line Program (SLP) SLP T for string w is a CFG in Chomsky normal form s.t. L( T ) = {w}.

9 Straight Line Program (SLP) Example X 1 = a X 2 = b X 3 = X 1 X 2 X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 n N N = O(2 n ) T = SLP

10 Straight Line Program (SLP) Example X 1 = a X 2 = b X 3 = X 1 X 2 X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 n N N = O(2 n ) T = SLP X8X8 X7X7 X5X5

11 Efficient algorithms for compressed strings substring matching – Karpinski et al (1996) O(n 4 logn) time – Miyazaki et al (1997) O(n 4 ) time – Lifshits (2006) O(n 3 ) time minimum period – Karpinski et al (1996) O(n 4 logn) time – Lifshits (2006) O(n 3 logN) time all squares – Gasieniec et al (1994) O(n 6 log 5 N) time

12 Hardness results Subsequence pattern matching – Lifshits and Lohrey (2006) NP-hard Longest common subsequence – Lifshits and Lohrey (2006) NP-hard Hamming distance – Lifshits (2007) #P-complete Is there any reasonable comparison measurement for compressed strings?

13 a b a a b a a a b b a a String comparison measures a b a a b a a a b b a a Hamming distance Longest common subsequence Longest common substring #P-comprete [Lifshits 07] NP-hard [Lifshits and Lohrey06] ?? O(N)O(N) uncompressed text compressed text O(N 2 / logN)O(N)O(N) we solve this problem a b a a b a a a b b a a

14 Our results

15 Problem Given two SLP T and S that are descriptions of text T and S respectively, compute LCStr(T, S). LCStr(T, S) : the length of longest common substring of T and S n : the total size of the input SLP Our Result1: Longest Common Substring Theorem O(n 4 logn) LCStr(T, S) can be computed in O(n 4 logn) time O(n 3 ) using O(n 3 ) space. Theorem O(n 4 logn) LCStr(T, S) can be computed in O(n 4 logn) time O(n 3 ) using O(n 3 ) space.

16 Problem Given SLP T, compute (compressed representations) the set of all palindromes of T. n : the size of SLP T N : the length of original text T (note that N = O(2 n ) Previous best result: O(n 5 log 4 N) time Our Result2: palindromes [Gasienec et al 1996] Theorem O(n 4 )O(n 2 ) The problem can be solved in O(n 4 ) time using O(n 2 ) space. Theorem O(n 4 )O(n 2 ) The problem can be solved in O(n 4 ) time using O(n 2 ) space.

17 Details of our algorithm Computing longest common substring Computing palindromes (omitted in this talk)

18 Property of common substrings (1/3) For each common substring Z of string S and T, there always exists a variable X i = X l X r and Y j = Y L Y R such that: – Z is a common substring of X i and Y j – Z contains an overlap between X l and Y R common substring Z Z Z Z XiXi XlXl XrXr YjYj YLYL YRYR w w Overlap

19 Property of common substrings (2/3) For each common substring Z of string S and T, there always exists a string w such that: – w is a substring of Z – w is an overlap of variables of S and T w w XiXi XlXl XrXr YjYj YLYL YRYR Overlap

20 Property of common substrings (3/3) For each common substring Z of string S and T, there always exists a string w such that: – Z can be calculate by expanding w common substring w w Z Z Z Z XiXi XlXl XrXr YjYj YLYL YRYR Extend Process Overlap

21 For any strings X, Y, Overlaps (OL) the set of the lengths of overlaps of X and Y. X Y

22 a a b a a b a Overlaps Example OL (“aabaaba”, “abaababb”) = {1, 3, 6} XlXl a b a a b a a b a b YRYR YRYR YRYR

23 Computing Overlaps [Karpinski et al 1996] Lemma For any variables X i and X j of SLP T, OL(X i, X j ) can be represented by O(n) arithmetic progressions. XiXi YjYj Theorem For any SLP T, OL(X i, X j ) can be computed in total of O(n 4 logn) time and O(n 3 ) space.

24 a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL

25 a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL match

26 a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL match

27 a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL match

28 a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL mismatch

29 How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YrYr YlYl a b a ∈ OL(X l, Y R ) mismatch

30 How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YrYr YlYl a b a ∈ OL(X l, Y R ) We are not allowed to process character by character.

31 First-mismatch function [Karpinski et al 1996] input : SLP variables X i and Y j, integer k output : position of first mismatch Mismatch k YjYj a b a b a a b a b a a b XiXi a b a b a b a a b a pp [p]}

32 First-mismatch function [Karpinski et al 1996] Lemma Provided that the sets of overlaps are already computed, FM(X i, Y j, k) can be computed in O(nlogn) time.

33 Extending overlaps using FM function Lemma Extending overlaps can be done by O(n) calls of FM function.

34 O(n 2 ) items pseudo-code Computing longest common substring O(n) calls of FM function. O(nlogn) times Totally, LCStr (S, T) can be computed in O(n 2 ×n×nlogn ) = O ( n 4 logn ) time.

35 Conclusions Computing longest common substring from compressed string – O(n 4 logn) time and O(n 3 ) space Computing all palindromes from compressed string – O(n 4 ) time and O(n 2 ) space

36 Thank you for your attention.


Download ppt "Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1, Shunsuke Inenaga 2, Akira Ishino 1, Ayumi Shinohara."

Similar presentations


Ads by Google