Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,

Similar presentations


Presentation on theme: "Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,"— Presentation transcript:

1 Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda (Kyushu University) 1 PSC 2016

2 2 Factorization of string  A factorization of a string w is a sequence f 1, …, f m of non-empty substrings of w such that w = f 1 … f m. Each substring in a factorization is called a factor.  There exist several factorizations with specific properties. – Lempel-Ziv factorizations [Ziv & Lempel, 1977] – LZW factorizations [Welch, 1984] – Lyndon factorizations [Chen et al., 1958]

3 3 Factorization of string In this work, we consider factorizations with repetitions.  A factorization of a string w is a sequence f 1, …, f m of non-empty substrings of w such that w = f 1 … f m. Each substring in a factorization is called a factor.  There exist several factorizations with specific properties. – Lempel-Ziv factorizations [Ziv & Lempel, 1977] – LZW factorizations [Welch, 1984] – Lyndon factorizations [Chen et al., 1958]

4 4 Repetitive structures a b a a b a a b c d a b c d a b c period 3 6 4 8  Integer p ≥ 1 is said to be a period of string w if w[i] = w[i+p] (1 ≤ i ≤ |w|−p). 11

5 5 Repetitive structures a b a a b a a b c d a b c d a b c square  Integer p ≥ 1 is said to be a period of string w if w[i] = w[i+p] (1 ≤ i ≤ |w|−p).  w is a square if |w|/2 is a period of w. period 3 6 4 8 11

6  Integer p ≥ 1 is said to be a period of string w if w[i] = w[i+p] (1 ≤ i ≤ |w|−p).  w is a square if |w|/2 is a period of w.  w is a repetition if the smallest period of w is at most |w|/2. 6 Repetitive structures a b a a b a a b c d a b c d a b c repetition period 3 6 4 8 11

7  Square factorization : each factor of the factorization is a square.  Size of factorization : the number of factors in the factorization. 7 Square factorization No square factorization exists a b a a b a b b a b c d e f g a b a a b a b a a b a a a a a

8  Repetition factorization : each factor of the factorization is a repetition.  Size of factorization : the number of factors in the factorization. 8 Repetition factorization a b a a b a b b a b c d e f g No repetition factorization exists a b a a b a b a a b a a a a a

9  Repetition factorization : each factor of the factorization is a repetition.  Size of factorization : the number of factors in the factorization.  There can be multiple factorizations of the same string. 9 Repetition factorization Size : 3 size : 2 a b a a b a b a a b a a a a a

10 Related work 10 SquareRepetition Any factorization O(n) time O(n) space O(n) time O(n) space Smallest factorization O(n log n) time O(n) space Largest factorization O(n log n) time O(n) space n is the length of the input string. [2][1] [2] [1] Dumitran et al., 2015 [2] Matsuoka et al., 2016

11 11 Related work and our contribution SquareRepetition Any factorization O(n) time O(n) space O(n) time O(n) space Smallest factorization O(n log n) time O(n) space O(n log n) time O(n) space Largest factorization O(n log n) time O(n) space O(n log n) time O(n) space n is the length of the input string. Our contribution! [1] Dumitran et al., 2015 [2] Matsuoka et al., 2016 [2][1] [2]

12 A run in a string is a maximal repetition of the string, i.e., its periodicity does not extend to the left nor the right. 12 Runs ( = maximal repetitions) a a b a b a a b a b a b b b a b a b a a b a b b b a b a b a a b a b a b 1 2 3 4 5 6 7 8 9 10 11 12 13 Runs of w w =

13 13 Runs ( = maximal repetitions) a b a b a a b a b a b b b a b a 1 2 3 4 5 6 7 8 9 10 11 12 13 A run in a string is a maximal repetition of the string, i.e., its periodicity does not extend to the left nor the right. b ≠ w =

14 14 Runs ( = maximal repetitions) a b a b a a b a b a b b b a b a 1 2 3 4 5 6 7 8 9 10 11 12 13 A run in a string is a maximal repetition of the string, i.e., its periodicity does not extend to the left nor the right. b b ≠ w =

15 We denote each run r by a triple (beg, end, p). ( beg : beginning position, end : ending position, p : smallest period) 15 Runs ( = maximal repetitions) a a b a b a a b a b a b b b a b a b a a b a b b b a b a b a a b a b a b 1 2 3 4 5 6 7 8 9 10 11 12 13 (1,10,5) Runs of w w =

16 16 Runs ( = maximal repetitions) Let Runs(w) denote the set of all runs of string w. Runs(w) can be computed in O(n) time for integer alphabet. Lemma 2 [Crochemore & Ilie, 2008] For any string w of length n, |Runs(w)|= O(n). Lemma 1 [Kolpakov & Kucherov, 1999]

17  Any repetition is a substring of a run, and has length at least twice the period.  A repetition factorization exists if and only if we can “traverse” such substrings from the beginning of the string to the end. 17 Main idea a a b a b a a b a b a b b b a b a b a a b a b b b a b a b a a b a b a b

18 We can formalize the idea and define a graph which we call the repetition graph. Then the problem reduces to a (weighted) path problem on the repetition graph. 18 Main idea a a b a b a a b a b a b b b a b a b a a b a b b b a b a b a a b a b a b

19 19 Repetition Graph 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b Below is an example of a repetition graph. The graph consists of two types of nodes (white and black), and several types of edges (black, red, and blue).

20 20 Repetition Graph 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b We consider positions between characters. One white node is defined for each position.

21 21 Repetition Graph 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b run square For each run r : (beg, end, p), add path for each square of period p within r  and connect black nodes of the same run with blue edges.

22 22 Repetition Graph For each run r : (beg, end, p), add path for each square of period p within r and connect black nodes of the same run with blue edges. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b run square

23 23 Repetition Graph 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b square By definition, there is a one to one correspondence between: non-empty paths that start and end at white nodes in the run, and subrepetitions of the run that start and end at those positions

24 24 Repetition Graph 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b square By construction, there is a one to one correspondence between: non-empty paths that start and end at white nodes in the run, and subrepetitions of the run that start and end at those positions

25 25 Repetition Graph 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 The following graph is the complete repetition graph for the example string.

26 26 Size of Repetition Graph There are n + 1 white nodes. For each square in run with same period ( “primitively rooted” squares), there is 1 black node and ≤ 3 edges (black, red, blue)  # of primitively rooted squares is O(n log n) [Crochemore and Rytter, 1995]  size of graph is O(n log n) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b

27 27 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 Each path corresponding to a repetition has exactly 1 black node. Reduction to Weighted Path Problem

28 28 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 There is a one to one correspondence between: paths from first white node to last white node, and repetition factorizations of w Reduction to Weighted Path Problem

29 29 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 There is a one to one correspondence between: paths from first white node to last white node, and repetition factorizations of w Reduction to Weighted Path Problem

30 30 Reduction to Weighted Path Problem 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 There is a one to one correspondence between: paths from first white node to last white node, and repetition factorizations of w  size of the factorization = total number of black edges in path

31 31 Reduction to Weighted Path Problem 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 If we can solve the weighted path problem on the repetition graph s.t. a weight of each black edge is 1 and a weight of each other edge is 0, then we can get a smallest/largest repetition factorization. 1 0 0 weight

32 32 Reduction to Weighted Path Problem 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b 1 0 1 0 1 0 0 1 0 r1r1 r4r4 r5r5 r6r6 There is a one to one correspondence between: paths from first white node to last white node, and repetition factorizations of w  size of the factorization = total number of black edges in path Size : 4 Weight : 4

33 Since repetition graphs are DAGs, the smallest/largest weighted path problem can be solved in linear time w.r.t. the size of the graph by using dynamic programming. 33 Reduction to the path problem 33

34 For each node, the value of the node can be computed from the smallest/largest value of incoming nodes. 34 Reduction to the path problem 34 1 0 0 weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112013 1 1 2 1 2

35 Suppose the calculation up to position 9 has been completed, we want to find the size of the factorization up to position 10. 35 Reduction to the path problem 35 1 0 0 weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112013 1 1 2 1 2

36 We first compute all values of black nodes at position 10, when we compute the value of white node at position 10. 36 Reduction to the path problem 36 1 0 0 weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112013 1 1 2 1 2

37 The value of the lower black node at position 10 is 3, = (the value of white node at position 6 ) + 1 37 Reduction to the path problem 37 1 0 0 weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112013 1 1 2 1 2 3

38 The value of the upper black node at position 10 is 1, = (the value of white node at position 0 ) + 1 38 Reduction to the path problem 38 1 0 0 weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112013 1 1 2 1 2 3 1

39 The value of white node is largest value of incoming nodes. 39 Reduction to the path problem 39 1 0 0 weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112013 1 1 2 1 2 3 1 3

40 Thus, we can get the following lemma. 40 Reduction to the path problem Given the repetition graph of a string of length n, a smallest / largest repetition factorization of the string can be computed in O(n log n) time. Lemma 4

41 41 Reducing space requirement  By definition, we can clearly construct the repetition graph in O(n log n) time.  O(n log n) time and O(n log n) space solution  We can simulate our algorithm without constructing the repetition graph explicitly.  O(n log n) time and O(n) space algorithm

42 42 Solution without Repetition Graph 1 0 0 weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112013 1 1 2 1 When the calculation up to position 9 has been completed, we want to find the size of the factorization up to position 10. 2

43 43 Solution without Repetition Graph 1 0 0 weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112013 1 1 2 1 2 We have runs r 2 and r 5 that contains squares that end at position 10, starting at positions 0 and 6, respectively.

44 44 Solution without Repetition Graph 1 0 0 weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112013 1 1 2 1 2 We also know from r 5, that there is a previous black node at position 9.

45 45 Solution without Repetition Graph 1 0 0 weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112013 1 1 2 1 2 3 The maximum path weight of the new black node for r 5 is the maximum of path weights between the previous black node for r 5, and the white node at position 6 plus 1.

46 46 Solution without Repetition Graph 1 0 0 weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112013 1 1 2 1 2 3 The idea is that we do not need the value of the black node at position 9 after computing the adjacent black node. Not needed anymore

47 47 Solution without Repetition Graph 1 0 0 weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112013 1 1 2 1 2 3 1 Similarly, the maximum path weight of the new black node for r 2 is the path weight of the white node at position 0 plus 1.

48 48 Solution without Repetition Graph 1 0 0 weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112013 1 1 2 1 2 3 1 Then we can compute the value of the white node at position 10. 3

49 49 Solution without Repetition Graph 1 0 0 weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112013 1 1 2 1 2 3 1 3 Since we only need to remember all white nodes and 1 black node per run, and since edges can be computed on the fly from the information of runs, it is possible to determine the size of the factorization in linear space.

50 50 Our main result We can compute a smallest/largest repetition factorization of a given string w in O(n log n) time and O(n) space. Theorem

51 Compute all runs in w, and sort them by the beginning positions. 51 A complete example 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6

52 Consider n+1 white nodes for each position. 52 A complete example 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6

53 For each position, if there is a square that ends at the position, then we compute the values of black node and white node. 53 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 A complete example

54 At position 0 : Since we have no square that ends at position 0, we do not add any node and edge. 54 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b 0 r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 A complete example

55 At position 0 : Since there is no black node at position 0, the value of white node at position 0 is 0. 55 A complete example 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b 0 r1r1 r2r2 r3r3 r4r4 r5r5 r6r6

56 At position 1-3 : Similar to position 0. 56 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 00 A complete example

57 57 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 000 At position 1-3 : Similar to position 0. A complete example

58 58 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000 At position 1-3 : Similar to position 0. A complete example

59 At position 4 : Since there is a square that ends at position 4, we know there is a black node. 59 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000 A complete example

60 At position 4 : The value is the weight of the white node at position 0 plus 1. 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0 1 000 A complete example

61 At position 4 : Then the value of the white node at position 4 is 1. 61 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0 1 0001 A complete example

62 62 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0 1 00 1 011 At position 5 : There is a square that ends at position 5. We can compute the values of black node and white node. A complete example

63 63 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 000 1 011 At position 5 : We do not need the value corresponding to the black node at position 4. 1 A complete example

64 64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 2 0 1 00 1 0112 At position 6 : There is a square that ends at position 6. We can compute the values of black node and white node. A complete example

65 65 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 2 0 1 00 1 01120 At position 7 : Since we have no square that ends at position 0, the value of white node at position 7 is 0. A complete example

66 66 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 2 1 0 1 00 1 011201 At position 8 : There is a square that ends at position 8. We can compute the values of black node and white node. A complete example

67 67 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b 0000112012 1 1 2 1 2 r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 At position 9 : There is a square that ends at position 9. We can compute the values of black node and white node. A complete example

68 68 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b 0000112012 1 1 2 1 2 3 1 3 r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 At position 10 : There are squares that end at position 10. We can compute the values of black nodes and white node. A complete example

69 69 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112012 1 1 2 1 2 3 1 3 3 3 At position 11 : There is a square that ends at position 11. We can compute the values of black node and white node. A complete example

70 70 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112012 1 1 2 1 2 3 1 3 3 34 4 At position 12 : There is a square that ends at position 12. We can compute the values of black node and white node. A complete example

71 71 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 0000112012 1 1 2 1 2 3 1 3 3 4 34 4 4 At position 13 : There is a square that ends at position 13. We can compute the values of black node and white node. A complete example

72 72 Lower bound of repetition graph There is a string that the size of graph is the Θ(n log n). (Fibonacci string) For any Fibonacci string of length n, the number of primitively rooted square is Θ(n log n). Lemma 6 [Fraenkel and Simpson, 1999]

73  We showed two algorithms to compute the smallest/largest repetition factorization for a given string of length n. With graph : O(n log n) time and space Without graph : O(n log n) time and O(n) space  Open question is whether there exists an efficient algorithm which computes repetition factorizations of smallest/largest size without relying on the graph. 73 Conclusions and open question


Download ppt "Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,"

Similar presentations


Ads by Google