Finding Characteristic Substrings from Compressed Texts Shunsuke Inenaga Kyushu University, Japan Hideo Bannai Kyushu University, Japan.

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

Two-dimensional pattern matching M.G.W.H. van de Rijdt 23 August 2005.
Fast Algorithms For Hierarchical Range Histogram Constructions
Greedy Algorithms Amihood Amir Bar-Ilan University.
Overview What is Dynamic Programming? A Sequence of 4 Steps
On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.
FORMAL LANGUAGES, AUTOMATA, AND COMPUTABILITY
HABATAKITAI Laboratory Everything is String. Computing palindromic factorization and palindromic covers on-line Tomohiro I, Shiho Sugimoto, Shunsuke Inenaga,
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Chapter 4 Normal Forms for CFGs Chomsky Normal Form n Defn A CFG G = (V, , P, S) is in chomsky normal form if each rule in G has one of.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
CS5371 Theory of Computation
CS 310 – Fall 2006 Pacific University CS310 Pushdown Automata Sections: 2.2 page 109 October 11, 2006.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
CS5371 Theory of Computation Lecture 9: Automata Theory VII (Pumping Lemma, Non-CFL)
Normal forms for Context-Free Grammars
Efficient algorithms for the scaled indexing problem Biing-Feng Wang, Jyh-Jye Lin, and Shan-Chyun Ku Journal of Algorithms 52 (2004) 82–100 Presenter:
1 Background Information for the Pumping Lemma for Context-Free Languages Definition: Let G = (V, T, P, S) be a CFL. If every production in P is of the.
Cs466(Prasad)L8Norm1 Normal Forms Chomsky Normal Form Griebach Normal Form.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
CS5371 Theory of Computation Lecture 12: Computability III (Decidable Languages relating to DFA, NFA, and CFG)
Chapter 12: Context-Free Languages and Pushdown Automata
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
Comp 245 Data Structures Stacks. What is a Stack? A LIFO (last in, first out) structure Access (storage or retrieval) may only take place at the TOP NO.
CSE 3813 Introduction to Formal Languages and Automata Chapter 8 Properties of Context-free Languages These class notes are based on material from our.
February 18, 2015CS21 Lecture 181 CS21 Decidability and Tractability Lecture 18 February 18, 2015.
Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan.
CSCI 2670 Introduction to Theory of Computing September 21, 2004.
1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.
Kazunori Hirashima 1, Hideo Bannai 1, Wataru Matsubara 2, Kazuhiko Kusano 2, Akira Ishino 2, Ayumi Shinohara 2 1 Kyushu University, Japan 2 Tohoku University,
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
Great Theoretical Ideas in Computer Science.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1, Shunsuke Inenaga 2, Akira Ishino 1, Ayumi Shinohara.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
5-1-1 CSC401 – Analysis of Algorithms Chapter 5--1 The Greedy Method Objectives Introduce the Brute Force method and the Greedy Method Compare the solutions.
Section 12.4 Context-Free Language Topics
Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology,
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu.
A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,
Closure Properties Lemma: Let A 1 and A 2 be two CF languages, then the union A 1  A 2 is context free as well. Proof: Assume that the two grammars are.
Chapter 6 Properties of Regular Languages. 2 Regular Sets and Languages  Claim(1). The family of languages accepted by FSAs consists of precisely the.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,
Lecture 16: Modeling Computation Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output.
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2005 Lecture 10Sept Carnegie Mellon University b b a b a a a b a b One.
Lecture # 31 Theory Of Automata By Dr. MM Alam 1.
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2006 Lecture 22 November 9, 2006Carnegie Mellon University b b a b a a a b a b.
Dipankar Ranjan Baisya, Mir Md. Faysal & M. Sohel Rahman CSE, BUET Dhaka 1000 Degenerate String Reconstruction from Cover Arrays (Extended Abstract) 1.
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
 2005 SDU Lecture11 Decidability.  2005 SDU 2 Topics Discuss the power of algorithms to solve problems. Demonstrate that some problems can be solved.
Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,
Efficient Computation of Substring Equivalence Classes with Suffix Arrays this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda.
Tries 07/28/16 11:04 Text Compression
Succinct Data Structures
Definition: Let G = (V, T, P, S) be a CFL
Properties of Context-Free Languages
Longest Common Subsequence
(B12) Multiplying Polynomials
Recap lecture 18 NFA corresponding to union of FAs,example, NFA corresponding to concatenation of FAs,examples, NFA corresponding to closure of an FA,examples.
Longest Common Subsequence
Huffman Coding Greedy Algorithm
Presentation transcript:

Finding Characteristic Substrings from Compressed Texts Shunsuke Inenaga Kyushu University, Japan Hideo Bannai Kyushu University, Japan

Text Mining and Text Compression Text mining is a task of finding some rule and/or knowledge from given textual data. Text compression is to reduce a space to store given textual data by removing redundancy. compress decompress

Our Contribution We present efficient algorithms to find characteristic substrings (patterns) from given compressed strings directly (i.e., without decompression). Longest repeating substring (LRS) Longest non-overlapping repeating substring (LNRS) Most frequent substring (MFS) Most frequent non-overlapping substring (MFNS) Left and right contexts of given pattern

X 1 = a X 2 = b X 3 = X 1 X 2 X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 T = SLP T Text Compression by Straight Line Program SLP T is a CFG in the Chomsky normal form which generates language {T}.

X 1 = a X 2 = b X 3 = X 1 X 2 X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 T = SLP T Text Compression by Straight Line Program Encodings of the LZ-family, run-length, Sequitur, etc. can quickly be transformed into SLP.

Exponential Compression by SLP Highly repetitive texts can be exponentially large w.r.t. the corresponding SLP-compressed texts. Text T = ababab…ab (T is an N repetition of ab ) SLP T : X 1 = a, X 2 = b, X 3 = X 1 X 2, X 4 = X 3 X 3, X 5 = X 4 X 4,..., X n = X n-1 X n-1 N = O(2 n ) Any algorithms that decompress given SLP- compressed texts can take exponential time! We present efficient (i.e., polynomial-time) algorithms without decompression.

Finding Longest Repeating Substring Input: SLP T which generates text T Output: A longest repeating substring (LRS) of T T ≠ ≠ T = aabaabcabaabb Example

Key Observation – 6 Cases of Occurrences of LRS XiXi XlXl XrXr XiXi XlXl XrXr XiXi XlXl XrXr XiXi XlXl XrXr XiXi XlXl XrXr Case 1 XiXi XlXl XrXr Case 2 Case 3 Case 6Case 5Case 4

Algorithm to Compute LRS Input: SLP T Output: LRS of text T foreach variable X i of SLP T do compute LRS of Case 1; compute LRS of Case 2; compute LRS of Case 3; compute LRS of Case 4; compute LRS of Case 5; compute LRS of Case 6; return two positions and the length of the “longest” LRS above;

Algorithm to Compute LRS Input: SLP T Output: LRS of text T foreach variable X i of SLP T do compute LRS of Case 1; compute LRS of Case 2; compute LRS of Case 3; compute LRS of Case 4; compute LRS of Case 5; compute LRS of Case 6; return two positions and the length of the “longest” LRS above; XiXi XlXl XrXr Case 1

Algorithm to Compute LRS Input: SLP T Output: LRS of text T foreach variable X i of SLP T do compute LRS of Case 1; compute LRS of Case 2; compute LRS of Case 3; compute LRS of Case 4; compute LRS of Case 5; compute LRS of Case 6; return two positions and the length of the “longest” LRS above; XiXi XlXl XrXr Case 1

Algorithm to Compute LRS Input: SLP T Output: LRS of text T foreach variable X i of SLP T do compute LRS of Case 1; compute LRS of Case 2; compute LRS of Case 3; compute LRS of Case 4; compute LRS of Case 5; compute LRS of Case 6; return two positions and the length of the “longest” LRS above;

XiXi XlXl XrXr Case 2 Algorithm to Compute LRS Input: SLP T Output: LRS of text T foreach variable X i of SLP T do compute LRS of Case 1; compute LRS of Case 2; compute LRS of Case 3; compute LRS of Case 4; compute LRS of Case 5; compute LRS of Case 6; return two positions and the length of the “longest” LRS above;

XiXi XlXl XrXr Case 2 Algorithm to Compute LRS Input: SLP T Output: LRS of text T foreach variable X i of SLP T do compute LRS of Case 1; compute LRS of Case 2; compute LRS of Case 3; compute LRS of Case 4; compute LRS of Case 5; compute LRS of Case 6; return two positions and the length of the “longest” LRS above;

Algorithm to Compute LRS Input: SLP T Output: LRS of text T foreach variable X i of SLP T do compute LRS of Case 1; compute LRS of Case 2; compute LRS of Case 3; compute LRS of Case 4; compute LRS of Case 5; compute LRS of Case 6; return two positions and the length of the “longest” LRS above;

Algorithm to Compute LRS Input: SLP T Output: LRS of text T foreach variable X i of SLP T do compute LRS of Case 1; compute LRS of Case 2; compute LRS of Case 3; compute LRS of Case 4; compute LRS of Case 5; compute LRS of Case 6; return two positions and the length of the “longest” LRS above; XiXi XlXl XrXr Case 3 LRS of X i of Case 3 is the longest common substring of X l and X r.

Longest Common Substring of Two SLPs Theorem 1 [Matsubara et al. 2009] For every pair of variables X l and X r, we can compute a longest common substring of X l and X r in total of O(n 4 logn) time. n is num. of variables in SLP T

Algorithm to Compute LRS Input: SLP T Output: LRS of text T foreach variable X i of SLP T do compute LRS of Case 1; compute LRS of Case 2; compute LRS of Case 3; compute LRS of Case 4; compute LRS of Case 5; compute LRS of Case 6; return two positions and the length of the “longest” LRS above; XiXi XlXl XrXr Case 4

XiXi XlXl XrXr XjXj

XjXj XiXi Case 4-1 XlXl XrXr XjXj XtXt XsXs

XlXl XtXt Overlap of X l and X t

XjXj XiXi Case 4-1 XlXl XrXr XjXj XtXt XsXs Overlap of X l and X t Expand overlap

XjXj XiXi Case 4-2 XlXl XrXr XjXj XtXt XsXs Overlap of X r and X s Expand overlap

Set of Overlaps X Y Set of length of overlaps of X and Y

a a b a a b a a b a a b a b b Set of Overlaps OL( aabaaba, abaababb ) = {1, 3, 6} X Y Y Y

Set of Overlaps Lemma 1 [Kaprinski et al. 1997] For every pair of variables X i and Y j, OL( X i, Y j ) forms O(n) arithmetic progressions. Lemma 2 [Kaprinski et al. 1997] For every pair of variables X i and Y j, OL( X i, Y j ) can be computed in total of O(n 4 logn) time. n is num. of variables in SLP T

Case 4 Lemma 3 For every variable X i, a longest repeating substring in Case 4 can be computed in O(n 3 logn) time. [Sketch of proof] We can expand all elements of each arithmetic progression of OL(X i, X j ) in O(nlogn) time. The size of OL(X i, X j ) is O(n) by Lemma 1. There are at most n-1 descendants X j of X i.

Algorithm to Compute LRS Input: SLP T Output: LRS of text T foreach variable X i of SLP T do compute LRS of Case 1; compute LRS of Case 2; compute LRS of Case 3; compute LRS of Case 4; compute LRS of Case 5; compute LRS of Case 6; return two positions and the length of the “longest” LRS above; XiXi XlXl XrXr Case 5 Symmetric to Case 4

Algorithm to Compute LRS Input: SLP T Output: LRS of text T foreach variable X i of SLP T do compute LRS of Case 1; compute LRS of Case 2; compute LRS of Case 3; compute LRS of Case 4; compute LRS of Case 5; compute LRS of Case 6; return two positions and the length of the “longest” LRS above; Similarly to Case 4 XlXl XrXr Case 6 XiXi

Finding Longest Repeating Substring Theorem 2 For any SLP T which generates text T, we can compute an LRS of T in O(n 4 logn) time. n is num. of variables in SLP T

Finding Longest Non-Overlapping Repeating Substring Input: SLP T which generates text T Output: A longest non-overlapping repeating substring (LNRS) of T T = ababababab Example LRS of T is abababab LRNS of T is abab

Finding Longest Non-Overlapping Repeating Substring Theorem 3 For any SLP T which generates text T, we can compute an LNRS of T in O(n 6 logn) time. n is num. of variables in SLP T

Finding Most Frequent Substring Input: SLP T which generates text T Output: A most frequent substring (MFS) of T The solution is always the empty string .

Finding Most Frequent Substring Input: SLP T which generates text T Output: A most frequent substring (MFS) of T of length 2 T

Algorithm to Compute MFS Input: SLP T Output: MFS of text T foreach substring P of T of length 2 do construct an SLP P which generates substring P ; compute num. of occurrences of P in T ; return substring of maximum num. of occurrences; |  | 2 substrings of length 2 Y3Y3 Y1Y1 Y2Y2 ab Lemma 4 For every pair of variables X i and Y j, the number of occurrences of Y j in X i can be computed in total of O(n 2 ) time.

Finding Most Frequent Substring Theorem 4 For any SLP T which generates text T, we can compute an MFS of T of length 2 in O(|  | 2 n 2 ) time. n is num. of variables in SLP T

T = aaaaababab Example Finding Most Frequent Non-Overlapping Substring Input: SLP T which generates text T Output: A most frequent non-overlapping substring (MFNS) of T of length 2 MFS of T of length 2 is aa MFNS of T of length 2 is ab

Finding Most Frequent Non-Overlapping Substring Theorem 5 For any SLP T which generates text T, we can compute an MFNS of T of length 2 in O(n 4 logn) time. n is num. of variables in SLP T

T = bbaabaabbaabb Example Computing Left and Right Contexts of Given Pattern Input: Two SLPs T and P which generate text T and pattern P, respectively Output: Substring  P  of T such that  (resp.  ) always precedes (resp. follows) P in T  and  are as long as possible P = ab  = ba  = 

Computing Left and Right Contexts of Given Pattern Examples of applications of computing left and right contexts of patterns are: Blog spam detection [Narisawa et al. 2007] Compute maximal extension of most frequent substrings (MFS) T

Boundary Lemma [1/2] Lemma 5 [Miyazaki et al. 1997] For any SLP variables X = X l X r and Y, the occurrences of Y that touch or cover the boundary of X form a single arithmetic progression. X XlXl XrXr Y Y Y

u u u u v v u u u u v v u u u u v v Boundary Lemma [2/2] Lemma 5 [Miyazaki et al. 1997] (Cont.) If the number of elements in the progression is more than 2, then the step of the progression is the smallest period of Y. X XlXl XrXr Y Y Y

      u u u u v v u u u u v v u u u u v v Left and Right Contexts X XlXl XrXr Y       The left context  of Y in X is a suffix of u. The right context  of Y in X is a prefix of uv[|v|: |uv|].

Computing Left and Right Contexts of Given Pattern Theorem 6 For any SLPs T and P which generate text T and pattern P, respectively, we can compute the left and right contexts of P in T in O(n 4 logn) time. n is num. of variables in SLP T

Conclusions and Future Work We presented polynomial time algorithms to find characteristic substrings of given SLP-compressed texts. Our algorithms are more efficient than any algorithms that work on uncompressed strings. Would it be possible to efficiently find other types of substrings from SLP-compressed texts? Squares (substrings of form xx ) Cubes (substrings of form xxx ) Runs (maximal substrings of form x k with k ≥ 2 ) Gapped palindromes (substrings of form xyx R with |y| ≥ 1 )