Efficient algorithms for the scaled indexing problem Biing-Feng Wang, Jyh-Jye Lin, and Shan-Chyun Ku Journal of Algorithms 52 (2004) 82–100 Presenter:

Slides:



Advertisements
Similar presentations
Xiaoming Sun Tsinghua University David Woodruff MIT
Advertisements

College of Information Technology & Design
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
MATH 224 – Discrete Mathematics
Longest Common Subsequence
1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.
Space-for-Time Tradeoffs
Inpainting Assigment – Tips and Hints Outline how to design a good test plan selection of dimensions to test along selection of values for each dimension.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Chapter 5: Decrease and Conquer
Deterministic Selection and Sorting Prepared by John Reif, Ph.D. Analysis of Algorithms.
T(n) = 4 T(n/3) +  (n). T(n) = 2 T(n/2) +  (n)
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or.
CSE115/ENGR160 Discrete Mathematics 02/24/11 Ming-Hsuan Yang UC Merced 1.
Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen
1 The Colussi Algorithm Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen Correctness and Efficiency of Pattern Matching Algorithms Information and Computation,
Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer.
CSE115/ENGR160 Discrete Mathematics 03/03/11 Ming-Hsuan Yang UC Merced 1.
1 Efficient String Matching : An Aid to Bibliographic Search Alfred V. Aho and Margaret J. Corasick Bell Laboratories.
Aho-Corasick String Matching An Efficient String Matching.
Reverse Colussi algorithm
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String Dan Gusfield and Jens Stoye Journal of Computer and System Science 69.
1 The Pumping Lemma for Context-Free Languages. 2 Take an infinite context-free language Example: Generates an infinite number of different strings.
Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.
C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 19: Searching and Sorting Algorithms.
Chapter 19: Searching and Sorting Algorithms
Jessie Zhao Course page: 1.
Topic 25 - more array algorithms 1 "To excel in Java, or any computer language, you want to build skill in both the "large" and "small". By "large" I mean.
Heapsort. Heapsort is a comparison-based sorting algorithm, and is part of the selection sort family. Although somewhat slower in practice on most machines.
MCS 101: Algorithms Instructor Neelima Gupta
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Longest increasing subsequences in sliding windows Michael H. Albert, Alexander Golynski, Angele M. Hamel, Alejandro Lopez-Ortiz, S. Srinivasa Rao, Mohammad.
Greedy Algorithms Input: Output: Objective: - make decisions “greedily”, previous decisions are never reconsidered Optimization problems.
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
MCS 101: Algorithms Instructor Neelima Gupta
Searching and Sorting Recursion, Merge-sort, Divide & Conquer, Bucket sort, Radix sort Lecture 5.
Section 5.5 The Real Zeros of a Polynomial Function.
Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.
Costas Busch - LSU1 Parsing. Costas Busch - LSU2 Compiler Program File v = 5; if (v>5) x = 12 + v; while (x !=3) { x = x - 3; v = 10; } Add v,v,5.
MotivationLocating the k largest subsequences: Main ideasResults Problem definitions Problem instance ( k=5 ) Bibliography
Packet Classification Using Dynamically Generated Decision Trees
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Generalization of a Suffix Tree for RNA Structural Pattern Matching Tetsuo Shibuya Algorithmica (2004), vol. 39, pp Created by: Yung-Hsing Peng Date:
Dipankar Ranjan Baisya, Mir Md. Faysal & M. Sohel Rahman CSE, BUET Dhaka 1000 Degenerate String Reconstruction from Cover Arrays (Extended Abstract) 1.
Nondeterministic Finite State Machines Chapter 5.
 2004 SDU Uniquely Decodable Code 1.Related Notions 2.Determining UDC 3.Kraft Inequality.
CSE15 Discrete Mathematics 03/06/17
Andreas Klappenecker [partially based on the slides of Prof. Welch]
COP 3503 FALL 2012 Shayan Javed Lecture 15
13 Text Processing Hongfei Yan June 1, 2016.
Searching CSCE 121 J. Michael Moore.
CS Algorithms Dynamic programming 0-1 Knapsack problem 12/5/2018.
Chapter 7 Space and Time Tradeoffs
Parsing Costas Busch - LSU.
Searching: linear & binary
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Topic 5: Heap data structure heap sort Priority queue
Longest Common Subsequence
Discrete Mathematics 7th edition, 2009
Space-for-time tradeoffs
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Space-for-time tradeoffs
CS 583 Analysis of Algorithms
Longest Common Subsequence
COMPUTING.
CMPT 225 Lecture 6 – Review of Complexity Analysis using the Big O notation + Comparing List ADT class implementations.
Presentation transcript:

Efficient algorithms for the scaled indexing problem Biing-Feng Wang, Jyh-Jye Lin, and Shan-Chyun Ku Journal of Algorithms 52 (2004) 82–100 Presenter: Yung-Hsing Peng Date:

Example for the problem Let T = a 5 c 6 a 1 c 1 a 4 b 3 (run length coding) P1 = a 2 c P2 = c 1 a 1 b 1 δ is the scaling function with parameter k If k = 2, we have δ 2 (P1) = a 4 c 2, δ 2 (P2) = c 2 a 2 b 2  δ 2 (P1) can be found in T, so P1 is a valid pattern In this example, P2 is not a valid pattern since it failed to every k.

Algorithm for Discrete Scaling For every positive integer k, construct a new string T k for T take x y for example, if y is divisible by k, then replace it by x (y/k), else replace it by x (y/k) $ x (y/k) ex: T = a 5 c 6 a 1 c 1 a 4 b 3 (with max repeat m = 6) T 1 = a 5 c 6 a 1 c 1 a 4 b 3 T 2 = a 2 $a 2 c 3 $$a 2 b 1 $b 1 T 3 = a 1 $a 1 c 2 $$a 1 $a 1 b 1 T 4 = a 1 $a 1 c 1 $c 1 $a 1 $ T 5 = a 1 c 1 $c 1 $$$$ T 6 = $c 1 $$$$ Theorem: Let P be a valid pattern, then P must be find in T 1 $T 2 $T 3 ……$T m

An Efficient Method to Build T k Use T k-1 to compute T k (use the index I k ) ex: T = a 5 c 6 a 1 c 1 a 4 b 3 (with max repeat m = 6) (index) I 1 = {1,2,3,4,5,6} T 1 = a 5 c 6 a 1 c 1 a 4 b 3 I 2 = {1,2,5,6} T 2 = a 2 $a 2 c 3 $$a 2 b 1 $b 1 I 3 = {1,2,5,6} T 3 = a 1 $a 1 c 2 $$a 1 $a 1 b 1 I 4 = {1,2,5} T 4 = a 1 $a 1 c 1 $c 1 $a 1 $I 5 = {1,2} T 5 = a 1 c 1 $c 1 $$$$I 6 = {2} T 6 = $c 1 $$$$I 7 = {} For any I k, there are at most (n/k) elements  |I 1 | + |I 2 | + |I 3 | + …. |I m | = nlogm  T 1 $T 2 $T 3 $...$T m can be built in O(nlogm)

Time Complexity of Discrete Scaling Lemma: T 1 $T 2 $T 3 …$T m can be built in O(nlogm) Lemma: For each T k, its length is O(n/k)  The length of T 1 $T 2 $T 3 …$T m is O(n/1 + n/2 + n/3 + ….+ n/m) = O(nlogm)  The suffix tree of T 1 $T 2 …$T m can be built in O(nlogm) where n is the length of T and m is the max repeat length of characters in T

Algorithm for the Decision Version of the Real Scaling (1/2) For every critical real number k, construct a new string T k for T Since the input pattern P is discrete in its run length coding  We can find all critical k by division. Ex: a 5 c 6 a 1 c 1 a 4 b 3  (1) divided by 1  {5, 6, 1, 4, 3} (2) divided by 2  {2.5, 3, 2, 1.5} (3) divided by 3  {1.66, 2, 1.33, 1} (4) divided by 4  {1.25, 1.5, 1} (5) divided by 5  {1, 1.2} (6) divided by 6  {1} If m is the max repeats in P, then the set Γ(T) of critical k can be computed by the union of (1)~(m)

Algorithm for the Decision Version of the Real Scaling (2/2) For all critical k in Γ(T), construct a new string T k for T take x y for example, if y is k-invertible, then replace it by x Ф(y, k), else replace it by x Ф(y, k) $ x Ф(y, k) where Ф(y, k) means the largest integer r that floor(k*r) ≤ y ex: T = a 5 c 6 a 1 c 1 a 4 b 3 (with max repeat m = 6) if k = 1.5, then T k = a 3 $a 3 c 4 $$a 3 b 2 Theorem: Let P be a valid pattern, then P must be find in T k1 $T k2 $T k3 ……$T kz, where z is the number of critical k In above example, if k = 1.7 then T k would be a 3 c 4 $$a 2 $a 2 b 2  The position of δ 1.7 (a 3 c 4 ) in T is different from that of δ 1.5 (a 3 c 4 ) in T  This algorithm can only solve the decision version of real scaling.

Time Complexity of Decision Version of Real Scaling Lemma: In worst case, the total number of critical k is O(n) Lemma: Each T ki can be computed in O(n) Lemma: T k1 $T k2 $T k3 ……$T kz can be built in O(n 2 )

Algorithm for the Real Scaling (1/4) Core: Generate all valid patterns and use them to build a Real Scale Indexing Tree (RSIT) to speed up searching.

Algorithm for the Real Scaling (2/4) The upper bound for the number of all valid patterns Since there are O(n 3 ) patterns, straightforward implementations would take O(n 4 ) in order to insert all patterns into RSIT. This paper gives an O(n 3 ) algorithm for doing so.

Algorithm for the Real Scaling (3/4) P*(g, l)  used to shrink the longest substring start from l, which can be shrink by g EX: T = a a a a b b b c c c a a a a, P = b c l = 4 P*(3,4) = b c a (means the red region shrinks by 3)  P is a prefix of P*(3,4)

Algorithm for the Real Scaling (4/4)

Conclusion of Real Scaled Indexing Problem