Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan.

Slides:

Advertisements

Similar presentations

Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.

Advertisements

INTERVAL TREE & SEGMENTATION TREE

On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.

Sparse Compact Directed Acyclic Word Graphs

HABATAKITAI Laboratory Everything is String. Computing palindromic factorization and palindromic covers on-line Tomohiro I, Shiho Sugimoto, Shunsuke Inenaga,

Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.

Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

T(n) = 4 T(n/3) +  (n). T(n) = 2 T(n/2) +  (n)

A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

15-853Page : Algorithms in the Real World Suffix Trees.

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.

296.3: Algorithms in the Real World

1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)

Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …

1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.

Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.

Sum of Subsets and Knapsack

Lowest Common Ancestors Two vertices (u, v) Lowest common ancestors, lca (u, v) Example lca (5, 6) = 4 lca (3, 7) = 2 lca (7, 8) = 1 l(v):

Mining Time-Series Databases Mohamed G. Elfeky. Introduction A Time-Series Database is a database that contains data for each point in time. Examples:

Transparency No. P2C4-1 Formal Language and Automata Theory Part II Chapter 4 Parse Trees and Parsing.

Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.

Costas Busch - RPI1 Mathematical Preliminaries. Costas Busch - RPI2 Mathematical Preliminaries Sets Functions Relations Graphs Proof Techniques.

Courtesy Costas Busch - RPI1 Mathematical Preliminaries.

6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.

Variable-Length Codes: Huffman Codes

Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences Thomas Schmidt Jens Stoye CPM 2004, Istanbul.

1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.

. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.

Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.

Mathematical Preliminaries Strings and Languages Preliminaries 1.

Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.

1 Trees 2 Binary trees Section Binary Trees Definition: A binary tree is a rooted tree in which no vertex has more than two children –Left and.

14/13/15 CMPS 3130/6130 Computational Geometry Spring 2015 Windowing Carola Wenk CMPS 3130/6130 Computational Geometry.

1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.

1 Chapter 1 Introduction to the Theory of Computation.

Finding Characteristic Substrings from Compressed Texts Shunsuke Inenaga Kyushu University, Japan Hideo Bannai Kyushu University, Japan.

CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.

Mathematical Preliminaries. Sets Functions Relations Graphs Proof Techniques.

Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp

Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.

 Rooted tree and binary tree  Theorem 5.19: A full binary tree with t leaves contains i=t-1 internal vertices.

Agenda Review: –Planar Graphs Lecture Content:  Concepts of Trees  Spanning Trees  Binary Trees Exercise.

Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,

Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda

Trees 2: Section 4.2 and 4.3 Binary trees. Binary Trees Definition: A binary tree is a rooted tree in which no vertex has more than two children

Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.

LIMITATIONS OF ALGORITHM POWER

1 Trees 2 Binary trees Section Binary Trees Definition: A binary tree is a rooted tree in which no vertex has more than two children –Left and.

Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,

February 17, 2005Lecture 6: Point Location Point Location (most slides by Sergi Elizalde and David Pritchard)

Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)

5.6 Prefix codes and optimal tree Definition 31: Codes with this property which the bit string for a letter never occurs as the first part of the bit string.

Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.

Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,

Efficient Computation of Substring Equivalence Classes with Suffix Arrays this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda.

Tries 07/28/16 11:04 Text Compression

Reducing the Space Requirement of LZ-index

RE-Tree: An Efficient Index Structure for Regular Expressions

12. Graphs and Trees 2 Summary

KD Tree A binary search tree where every node is a

Orthogonal Range Searching and Kd-Trees

Reachability on Suffix Tree Graphs

Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.

Suffix Trees String … any sequence of characters.

CMPS 3130/6130 Computational Geometry Spring 2017

Chapter 1 Introduction to the Theory of Computation

Presentation transcript:

Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan August 24-26, 2015 PSC2015

The string which is common among some documents characterizes a set of documents. Characteristic String of Documents T1T1 = praguestringabc T2T2 = bacompscienceap T3T3 = apscreenapscite T4T4 = strconferenceab T5T5 = wepscompresseda

W D (x) : number of distinct strings in D which have x as a substring. d-Right-Maximal Generic Words Let D = {T 1, …, T m } be a set of strings. Given a pattern P and positive integer d (≤ m), compute all d -right-maximal extensions of P. Problem [Kucherov et al., SPIRE 2012] A string x is a d -right-maximal extension of P if P is a prefix of x W D (x) ≥ d W D (xa) < d for any character a.

Example d-Right-Maximal Generic Words Let D = {T 1, …, T m } be a set of strings. Given a pattern P and positive integer d (≤ m), compute all d -right-maximal extensions of P. Problem [Kucherov et al., SPIRE 2012] T1T1 = ababaabaaaacb T2T2 = cbaabacabaabc T3T3 = bbabaaca P = aa, d = 2 output = { aaba, aac }

d-Right-Maximal Generic Words There exists an O(n) -space data structure which can compute the all d -right-maximal extensions of P in O(|P| + rocc) time. The data structure can be constructed in O(n) time. Theorem [Kucherov et al., SPIRE 2012] n : total length of strings in D rocc : number of d -right-maximal extensions of P Each d -right-maximal extension is corresponds to a branching node in generalized suffix tree of D.

Each leaf of generalized suffix tree of D corresponds to a suffix of a string in D. Generalized Suffix Tree (GST) Example T 1 = aabaab, T 2 = aabab, T 3 = babaaa

Notations on generalized suffix tree of D Generalized Suffix Tree (GST) GST D : generalized suffix tree of D GST D (u) : subtree rooted at a node u str D (u) : string which is represented by a node u in GST D weight D (u) : = W D (str D (u)) maxchild D (u) : maximum weight of child of u L(P) : locus of P

Each answer corresponds to a branching node in GST D (L(P)). d-Right-Maximal Generic Words Example T 1 = aabaab, T 2 = aabab, T 3 = babaaa $3$ $1$1 $2$2 $3$3 b a a b a b a a a $1$1 b b b $1$1 $2$2 $3$3 $1$1 $3$3 $3$3 $3$3 $2$2 a a $3$3 $2$2 a a a a a a $2$2 $1$1 $2$2 b b $1$1 b $1$1 P = ab, d = 2 output = { abaa } L(P)L(P) weight D (u) ≥ 2, maxchild D (u) < 2

New Problem Let D = {T 1, …, T m } be a set of strings. Given a pattern P and positive integer d (≤ m), compute all d -left-right-maximal extensions of P. Problem Example T1T1 = ababaabaaaacb T2T2 = cbaabacabaabc T3T3 = bbabaaca P = aa, d = 2 output = { baaba, abaab, babaa }

Our Contribution Let D = {T 1, …, T m } be a set of strings. Given a pattern P and positive integer d (≤ m), compute all d -left-right-maximal extensions of P. Problem There exists an O(n log n) -space data structure which can compute the all d -right-maximal extensions of P in O(|P| + occ log 2 n + rocc log n) time. Theorem n : total length of strings in D rocc : number of d -right-maximal extensions of P occ : number of d -left-right-maximal extensions of P

Each answer may not correspond to a branching node in GST D (L(P)). d-Left-Right-Maximal Generic Words Example T 1 = aabaab, T 2 = aabab, T 3 = babaaa P = ab, d = 2 output = { abaa, aaba }

Main Idea TiTi P Each d -left-right-maximal extension of P has right (not necessary maximal) extension of P as a suffix.

Main Idea TiTi P If we check d -left-maximal extension of all right extensions of P, we can obtain all answers. We consider such extensions on GST.

For any branching right ( not necessary maximal ) extension of P, we compute its d -left-maximal extension. Main Idea GST D P ≥ d GST D R ≥ d d ≤d ≤ d >d > d > < d weight D R (v) ≥ d maxchild D R (v) < d Such nodes v are candidates of answers. L(str(u) R ) = r(u) L(P)L(P) u D R = {T 1 R, …, T m R }

Main Idea GST D P ≥ d GST D R ≥ d d ≤d ≤ d >d > d > < d L(str(u) R ) = r(u) L(P)L(P) u cand(u) REx Cand(REx) = ∪ u ∈ REx cand(u) set of candidates

Cand ( REx ) may contains non-answers. We want to remove such nodes from Cand ( REx ), so we characterize above nodes. Cand(REx)

The nodes in Cand ( REx ) which are not answers are not d -right-maximal. Non-answers TiTi P × × We should check weather d -right-maximal or not. To do so, we need information of node r’(v) for each node v in GST D R. r’(v) : node in GST D s.t. str(v) R = str(r’(v)) (It may be implicit node.)

Remove non-answers GST D P ≥ d GST D R ≥ d d ≤d ≤ d >d > d > < d weight D R (v) ≥ d maxchild D R (v) < d L(str(u) R ) = r(u) L(P)L(P) u r’(v) v We check whether the node v is d -right-maximal or not by checking maxchild D (r’(v)). maxchild D (r’(v)) < d d -left-maximal d -right-maximal

We define the following subset of answers. Remove non-answers cand’(u) = {v | v ∈ cand(u) and maxchild D (r’(v)) < d} We compute cand’(u) by using range reporting query.

preord(v) : rank of preorder traversal in GST D’ end(v) : maximum rank in GST D’ (v) Computing cand’(u) preord(r(u)) ≤ preord(v) ≤ end(r(u)) weight(v) ≥ d maxchild(v) < d maxchild(r’(v)) < d preord(r(u)) ≤ preord(v) ≤ end(r(u)) max{maxchild(v), maxchild(r’(v))} < d ≤ weight(v) The nodes v in cand’(u) satisfy the following. We compute the nodes which satisfy these formula by using segment intersection query.

The nodes in GST D’ correspond to horizontal segments. The query correspond to vertical segment. Segment Intersection Query Problem preord(r(u)) end(r(u)) d beg(r(u)) ≤ preord(v) ≤ end(r(u)) max{maxchild(v), maxxhild(r’(v))} < d ≤ weight(v)

The number of horizontal segments is O(n). Computing cand’(u) Segment Intersection Query can be answered in O(loglog n + k) time with O(n) space data structure where n is the number of segments and k is the size of output. Lemma [Chan, 2013] For any node u in REx, cand’(u) can be answered in O(loglog n + |cand’(u)|) time with O(n) space data structure. Lemma

We can obtain the set of answers by computing cand’(u) for all node u in REx. There exist duplication and nodes u s.t. cand’(u) = ∅. We can skip such right extensions by using a range reporting query and a binary search on GST. Meaningful Right Extensions There exists an O(n log n) -space data structure which can compute the all d -right-maximal extensions of P in O(|P| + occ log 2 n + rocc log n) time. Theorem

Conclusion Let D = {T 1, …, T m } be a set of strings. Given a pattern P and positive integer d (≤ m), compute all d -left-right-maximal extensions of P. Problem There exists an O(n log n) -space data structure which can compute the all d -right-maximal extensions of P in O(|P| + occ log 2 n + rocc log n) time. Theorem n : total length of strings in D rocc : number of d -right-maximal extensions of P occ : number of d -left-right-maximal extensions of P

Consider a more efficient algorithm. Can a single document version be solved more easily? special case of this problem Consider the minimal discriminating words problem for left-right extensions. Future Work Thank You !

Cand’(REx) may contains duplications because of definition of REx. About Cand’(REx) We want to remove such nodes from Cand’(REx), so we characterize above nodes.

If there exists an answer s.t. P occurs in the answer at least two times, there exist duplicated answers. Duplicated Answers P TiTi P P Let u be a node in REx s.t. P occurs in str(u) at least two times. For any node v s.t. str(v) is a proper suffix of str(u), cand’(u) ⊆ cand’(v). Lemma × ×

We use the following lemma. Checking P’s Let u be a node in REx. preord(u1) < beg(L(P)) ≤ end(L(P)) < preord(u2) iff P occurs in str(u) at once (P is a prefix of str(u)). Lemma k : SA str(u) [k] = 1 u1 : str(u1) = str(u)[SA str(u) [k−1]..|str(u)|] u2 : str(u2) = str(u)[SA str(u) [k+1]..|str(u)|]

Checking P’s GST D P L(P)L(P) u str(u1) i 1 str(u2) j P SA … … = str(u)

Checking P’s GST D P L(P)L(P) u str(u1) i 1 j P SA … … = str(u) P = str(u2)

≤ ≥ ≺ ≻ ≼ ≽ ≠ ∈ ∉ ∑ ∅ ∀ ∃ ⇔ symbol

P=aa,d=2 1. ababaabaaaacb 1. cbaabacabaabc 1. bbabaaca 具体例（案）

Each answer corresponds to a branching node in GST D. d-Right-Maximal Generic Words 2 a … a bc a bc c abc 11 ac 21 ac 13 ab $1$1 $2$2 $3$3 $3$3 … … … … … … L(P)L(P) Example P = aa, d = 2 T 1 = ababaabaaaacb, T 2 = cbaabacabaabc, T 3 = bbabaaca locus of P Such nodes exist in GST D (L(P)).

We cannot use related work [Kucherov et al.] directly. Main Idea TiTi P …

T1T1 = praguestringabc Generic Words T2T2 = bacompscienceap T3T3 = apscreenapscite T4T4 = strconferenceab T5T5 = wepscompresseda generic (or characteristic)

Main Idea cand(u) = {v | v ∈ GST D R (r(u)), weight D R (v) ≥ d, maxchild D R (v) < d} Cand(REx) = ∪ u ∈ REx cand(u) REx = {u | u ∈ GST D (L(P)), weight D (u) ≥ d} We define such a set of candidates Cand(REx) as the following. d-left-maximal extensions of a right extension of P right extensions of P candidates

We define a set of answers Cand’(REx), by removing non-answers from Cand(REx). Answers cand’(u) = {v | v ∈ GST D R (r(u)), weight D R (v) ≥ d, maxchild D R (v) < d, MFC(v) < d} Cand’(REx) = ∪ u ∈ REx cand’(u) d-left-maximal extensions of a right extension of P answers (with duplications) added

MFC ( v ) = 3 str ( v ) R is followed by “a” in 3 distinct strings. str ( v ) R is followed by “b” in 2 distinct strings. Non-answers MFC(v) : the maximum number of strings in D which have str ( v ) R c as a substring for any character c. T1T1 T2T2 T3T3 T4T4 T5T5 str(v) R a aa a b b Example

Non-answers By using this information, the following lemma holds. For any node v ∈ cand(u) for some u ∈ REx, MFC(v) ≥ d iff str ( v ) R is not an answer. Lemma MFC(v) : the maximum number of strings in D which have str ( v ) R c as a substring for any character c.

両側の方が自然なんでみぎだけか