Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1, Shunsuke Inenaga 2, Akira Ishino 1, Ayumi Shinohara.

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.
Fast Algorithms For Hierarchical Range Histogram Constructions
Longest Common Rigid Subsequence Bin Ma and Kaizhong Zhang Department of Computer Science University of Western Ontario Ontario, Canada.
CPSC 335 Dynamic Programming Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Overview What is Dynamic Programming? A Sequence of 4 Steps
Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya KIDA Department of Informatics Kyushu University, Japan Masayuki TAKEDA Ayumi SHINOHARA.
HABATAKITAI Laboratory Everything is String. Computing palindromic factorization and palindromic covers on-line Tomohiro I, Shiho Sugimoto, Shunsuke Inenaga,
The Design and Analysis of Algorithms
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Secure Outsourcing of Sequence Comparisons Mikhail Atallah and Jiangtao Li CERIAS and Department of Computer Sciences Purdue University PET2004: Workshop.
Property Matching and Weighted Matching Amihood Amir, Eran Chencinski, Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
6/20/2015List Decoding Of RS Codes 1 Barak Pinhas ECC Seminar Tel-Aviv University.
Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications Written by: Amihood Amir, Oren Kapah and Ely Porat.
Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.
Dynamic Programming Optimization Problems Dynamic Programming Paradigm
1 Convolution and Its Applications to Sequence Analysis Student: Bo-Hung Wu Advisor: Professor Herng-Yow Chen & R. C. T. Lee Department of Computer Science.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Improved Approximation Bounds for Planar Point Pattern Matching (under rigid motions) Minkyoung Cho Department of Computer Science University of Maryland.
11 -1 Chapter 11 Randomized Algorithms Randomized algorithms In a randomized algorithm (probabilistic algorithm), we make some random choices.
S C A L E D Pattern Matching Amihood Amir Ayelet Butman Bar-Ilan University Moshe Lewenstein and Johns Hopkins University Bar-Ilan University.
EEE377 Lecture Notes1 EEE436 DIGITAL COMMUNICATION Coding En. Mohd Nazri Mahmud MPhil (Cambridge, UK) BEng (Essex, UK) Room 2.14.
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
Osyczka Andrzej Krenich Stanislaw Habel Jacek Department of Mechanical Engineering, Cracow University of Technology, Krakow, Al. Jana Pawla II 37,
Chapter 2 Source Coding (part 2)
New Lower Bounds for the Maximum Number of Runs in a String Wataru Matsubara 1, Kazuhiko Kusano 1, Akira Ishino 1, Hideo Bannai 2, Ayumi Shinohara 1 1.
Time Complexity Dr. Jicheng Fu Department of Computer Science University of Central Oklahoma.
2.3 Functions A function is an assignment of each element of one set to a specific element of some other set. Synonymous terms: function, assignment, map.
Closest String with Wildcards ( CSW ) Parameterized Complexity Analysis for the Closest String with Wildcards ( CSW ) Problem Danny Hermelin Liat Rozenberg.
Logic Circuits Chapter 2. Overview  Many important functions computed with straight-line programs No loops nor branches Conveniently described with circuits.
Finding Characteristic Substrings from Compressed Texts Shunsuke Inenaga Kyushu University, Japan Hideo Bannai Kyushu University, Japan.
Kazunori Hirashima 1, Hideo Bannai 1, Wataru Matsubara 2, Kazuhiko Kusano 2, Akira Ishino 2, Ayumi Shinohara 2 1 Kyushu University, Japan 2 Tohoku University,
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
CSCI 3160 Design and Analysis of Algorithms Tutorial 10 Chengyu Lin.
Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
Prof. Swarat Chaudhuri COMP 482: Design and Analysis of Algorithms Spring 2012 Lecture 16.
1 Chapter 6 Dynamic Programming. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, optimizing some local criterion. Divide-and-conquer.
Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology,
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu.
A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,
ALGORITHMS.
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
1 Security through complexity Ana Nora Sovarel. 2 Projects Please fill one slot on the signup sheet. One meeting for each group. All members must agree.
Tommy Messelis * Stefaan Haspeslagh Patrick De Causmaecker *
Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.
Forrelation: A Problem that Optimally Separates Quantum from Classical Computing.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Average Value of Sum of Exponents of Runs in Strings Kazuhiko Kusano, Wataru Matsubara, Akira Ishino, Ayumi Shinohara Graduate School of Information Sciences.
Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,
Advanced Algorithms Analysis and Design
Succinct Data Structures
Objective of This Course
Reachability on Suffix Tree Graphs
On the k-Closest Substring and k-Consensus Pattern Problems
3. Brute Force Selection sort Brute-Force string matching
3. Brute Force Selection sort Brute-Force string matching
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1, Shunsuke Inenaga 2, Akira Ishino 1, Ayumi Shinohara 1, Tomoyuki Nakamura 1, Kazuo Hashimoto 1 1 Graduate School of Information Sciences Tohoku University, Japan 2 Department of Computer Science and Communication Engineering, Kyushu University, Japan

Background and motivations

What is compressed string algorithm? A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : input text

What is compressed string algorithm? A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : input text find palindromes output mm isi zz iprefrepi borroworrob wasitabarorabatisow oo :

What is compressed string algorithm? A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : find palindromes output mm isi zz iprefrepi borroworrob wasitabarorabatisow oo : decompress e)%eARY)(ReJD)OIHOIFEnkkdi we02kfo)J”LPEPJ9wEOW*# eO … compressed text One solution would be to decompress the compressed text. The decompressed size can be exponentially large with respect to the compressed size. decompressed text

Goal of algorithms for Compressed strings Process the compressed text without decompression. Processing time should be polynomial in n. – Decompressed size can be exponentially large with respect to n. n : the size of compressed text

Compressed schemes run-length encoding Lempel-Ziv grammar based compression : Straight Line Program [Rytter2003] Resulting achieve of most practical compression methods can be transformed into SLP generating the same original text. [Rytter2003] Resulting achieve of most practical compression methods can be transformed into SLP generating the same original text.

SLP T T : sequence of assignments X 1 = expr 1 ; X 2 = expr 2 ; … ; X n = expr n ; X k : variable, a ( a  X i X j ( i, j < k ). expr k : Definition of Straight Line Program (SLP) SLP T for string w is a CFG in Chomsky normal form s.t. L( T ) = {w}.

Straight Line Program (SLP) Example X 1 = a X 2 = b X 3 = X 1 X 2 X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 n N N = O(2 n ) T = SLP

Straight Line Program (SLP) Example X 1 = a X 2 = b X 3 = X 1 X 2 X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 n N N = O(2 n ) T = SLP X8X8 X7X7 X5X5

Efficient algorithms for compressed strings substring matching – Karpinski et al (1996) O(n 4 logn) time – Miyazaki et al (1997) O(n 4 ) time – Lifshits (2006) O(n 3 ) time minimum period – Karpinski et al (1996) O(n 4 logn) time – Lifshits (2006) O(n 3 logN) time all squares – Gasieniec et al (1994) O(n 6 log 5 N) time

Hardness results Subsequence pattern matching – Lifshits and Lohrey (2006) NP-hard Longest common subsequence – Lifshits and Lohrey (2006) NP-hard Hamming distance – Lifshits (2007) #P-complete Is there any reasonable comparison measurement for compressed strings?

a b a a b a a a b b a a String comparison measures a b a a b a a a b b a a Hamming distance Longest common subsequence Longest common substring #P-comprete [Lifshits 07] NP-hard [Lifshits and Lohrey06] ?? O(N)O(N) uncompressed text compressed text O(N 2 / logN)O(N)O(N) we solve this problem a b a a b a a a b b a a

Our results

Problem Given two SLP T and S that are descriptions of text T and S respectively, compute LCStr(T, S). LCStr(T, S) : the length of longest common substring of T and S n : the total size of the input SLP Our Result1: Longest Common Substring Theorem O(n 4 logn) LCStr(T, S) can be computed in O(n 4 logn) time O(n 3 ) using O(n 3 ) space. Theorem O(n 4 logn) LCStr(T, S) can be computed in O(n 4 logn) time O(n 3 ) using O(n 3 ) space.

Problem Given SLP T, compute (compressed representations) the set of all palindromes of T. n : the size of SLP T N : the length of original text T (note that N = O(2 n ) Previous best result: O(n 5 log 4 N) time Our Result2: palindromes [Gasienec et al 1996] Theorem O(n 4 )O(n 2 ) The problem can be solved in O(n 4 ) time using O(n 2 ) space. Theorem O(n 4 )O(n 2 ) The problem can be solved in O(n 4 ) time using O(n 2 ) space.

Details of our algorithm Computing longest common substring Computing palindromes (omitted in this talk)

Property of common substrings (1/3) For each common substring Z of string S and T, there always exists a variable X i = X l X r and Y j = Y L Y R such that: – Z is a common substring of X i and Y j – Z contains an overlap between X l and Y R common substring Z Z Z Z XiXi XlXl XrXr YjYj YLYL YRYR w w Overlap

Property of common substrings (2/3) For each common substring Z of string S and T, there always exists a string w such that: – w is a substring of Z – w is an overlap of variables of S and T w w XiXi XlXl XrXr YjYj YLYL YRYR Overlap

Property of common substrings (3/3) For each common substring Z of string S and T, there always exists a string w such that: – Z can be calculate by expanding w common substring w w Z Z Z Z XiXi XlXl XrXr YjYj YLYL YRYR Extend Process Overlap

For any strings X, Y, Overlaps (OL) the set of the lengths of overlaps of X and Y. X Y

a a b a a b a Overlaps Example OL (“aabaaba”, “abaababb”) = {1, 3, 6} XlXl a b a a b a a b a b YRYR YRYR YRYR

Computing Overlaps [Karpinski et al 1996] Lemma For any variables X i and X j of SLP T, OL(X i, X j ) can be represented by O(n) arithmetic progressions. XiXi YjYj Theorem For any SLP T, OL(X i, X j ) can be computed in total of O(n 4 logn) time and O(n 3 ) space.

a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL

a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL match

a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL match

a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL match

a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL mismatch

How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YrYr YlYl a b a ∈ OL(X l, Y R ) mismatch

How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YrYr YlYl a b a ∈ OL(X l, Y R ) We are not allowed to process character by character.

First-mismatch function [Karpinski et al 1996] input : SLP variables X i and Y j, integer k output : position of first mismatch Mismatch k YjYj a b a b a a b a b a a b XiXi a b a b a b a a b a pp [p]}

First-mismatch function [Karpinski et al 1996] Lemma Provided that the sets of overlaps are already computed, FM(X i, Y j, k) can be computed in O(nlogn) time.

Extending overlaps using FM function Lemma Extending overlaps can be done by O(n) calls of FM function.

O(n 2 ) items pseudo-code Computing longest common substring O(n) calls of FM function. O(nlogn) times Totally, LCStr (S, T) can be computed in O(n 2 ×n×nlogn ) = O ( n 4 logn ) time.

Conclusions Computing longest common substring from compressed string – O(n 4 logn) time and O(n 3 ) space Computing all palindromes from compressed string – O(n 4 ) time and O(n 2 ) space

Thank you for your attention.