Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Slides:



Advertisements
Similar presentations
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Advertisements

College of Information Technology & Design
Shortest Vector In A Lattice is NP-Hard to approximate
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
MATH 224 – Discrete Mathematics
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Longest Common Subsequence
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
Deterministic Finite Automata (DFA)
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Analysis of Algorithms CS Data Structures Section 2.6.
Fast Algorithms For Hierarchical Range Histogram Constructions
Determinization of Büchi Automata
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
On the suffix automaton with mismatches Maxime Crochemore, Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.
SUFFIX TREES From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi.
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
1 Efficient String Matching : An Aid to Bibliographic Search Alfred V. Aho and Margaret J. Corasick Bell Laboratories.
Aho-Corasick String Matching An Efficient String Matching.
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
By Makinen, Navarro and Ukkonen. Abstract Let A and B be two run-length encoded strings of encoded lengths m’ and n’, respectively. we will show an O(m’n+n’m)
Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Languages with mismatches and an application to approximate indexing Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.
Variable-Length Codes: Huffman Codes
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
Great Theoretical Ideas in Computer Science.
Induction and recursion
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 7 Mälardalen University 2010.
Theory of Computing Lecture 15 MAS 714 Hartmut Klauck.
Lecture 23: Finite State Machines with no Outputs Acceptors & Recognizers.
DECIDABILITY OF PRESBURGER ARITHMETIC USING FINITE AUTOMATA Presented by : Shubha Jain Reference : Paper by Alexandre Boudet and Hubert Comon.
Order Statistics The ith order statistic in a set of n elements is the ith smallest element The minimum is thus the 1st order statistic The maximum is.
Learning Automata and Grammars Peter Černo.  The problem of learning or inferring automata and grammars has been studied for decades and has connections.
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.
Great Theoretical Ideas in Computer Science.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Ravello, /09C.E. On some researches... Chiara Epifanio.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Deterministic Finite Automata COMPSCI 102 Lecture 2.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
NP-Completness Turing Machine. Hard problems There are many many important problems for which no polynomial algorithms is known. We show that a polynomial-time.
Finding Regular Simple Paths Sept. 2013Yangjun Chen ACS Finding Regular Simple Paths in Graph Databases Basic definitions Regular paths Regular simple.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Great Theoretical Ideas in Computer Science for Some.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Copyright © Cengage Learning. All rights reserved.
@#? Text Search g ~ A R B n f u j u q e ! 4 k ] { u "!"
Enumerating Distances Using Spanners of Bounded Degree
CSE 2010: Algorithms and Data Structures Algorithms
Discrete Mathematics 7th edition, 2009
Presentation transcript:

Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino

Ravello, Settembre 2003Indexing Structures for Approximate String Matching Approximate string matching concerns to find patterns in texts in presence of “mismatches” or “errors”. It has several applications in data analysis and data retrieval, such as: The nature of mismatches depends on the problem or application considered and can be well captured in a formal way by introducing distances among strings. searching text under the presence of typing or spelling errors; retrieving musical passages; finding biological sequences in presence of possible mutations or misreads.

Ravello, Settembre 2003Indexing Structures for Approximate String Matching Ex.: x=acgtatct, y=aggttact The distance d(x,y) between two strings x and y is the minimal cost of a sequence of operations that transform x into y (and  if no such sequence exists). The different possible operations are: Let d:  *x  *  R + be a distance function. We consider one of the most commonly used distance functions, the Hamming distance, that allows only substitutions, which cost 1 in the simplified definition. It is finite whenever |x|=|y| and it holds 0  d(x,y)  |x|. Ex.: x=acgtatct, y=aggttact d(x,y)=3 (in the simplified definition) 3) Substitution, 4) Transposition. 1) Insertion, 2) Deletion,

Ravello, Settembre 2003Indexing Structures for Approximate String Matching Let S be a string over the alphabet  and let k, r be non negative integers such that k  r. A string v occurs in S at the position l, up to k errors in a window of size r if: 1) |v| < r  d (v, S (l,l+|v|-1) )  k; 2) |v|  r   i, 1  i  |v|-r+1, d( v(i, i+r-1),S(l+i,l+i+r-1))  k. L(S,k,r) is the set of the words that satisfy the previous definition for some l, 1  l  |S|-|v|+1. Typical approaches in this field consist in considering a percentage D of errors or fixing the number k of them. The new idea in our approach is to introduce a new parameter r and to allow at most k errors for any substring of length r, where r is not necessarily constant.

Ravello, Settembre 2003Indexing Structures for Approximate String Matching An Index over a fixed text S is an abstract data type based on the set of all factors of S, denoted by Fact(S). Such data type is equipped with some operations that allow it to answer the following query: given x  Fact(y), find the list of all its occurrences in y. This operation can easily be extended to the case of approximate string matching.

Ravello, Settembre 2003Indexing Structures for Approximate String Matching Statement of the problem: Given a “text” S, a “pattern” x and two integers k and r, return all the text positions l, such that x occurs in S at position l, up to k errors for r symbols. Natural Solution: Building an automaton recognizing the language L(k,S,r). determinization minimization Exponential size!!

Ravello, Settembre 2003Indexing Structures for Approximate String Matching Let u be a string over the alphabet , the neighbourhood of u is the set of all words that have at most k errors in every windows of size r respect to u, i.e.: V(u,k,r)=L(u,k,r)  |u|. Different bounds from the classical exponential ones have been obtained by using a new parameter R, called Repetition Index. The Repetition Index R(S,k,r) of S is the smallest value of an integer h such that all strings of this length occur at most in a unique position of the text up to k errors for r symbols: R(S,k,r) = min{ h  1 s.t.  i, j,1  i, j  |S| - h + 1, V(S(i,i+h-1),k,r)  V(S(j,j+h-1),k,r)   i=j}. R(S,k,r) is always defined because h=|S| is an element of the set above described; If k/r  1/2 then R(S,k,r)=|S|.

Ravello, Settembre 2003Indexing Structures for Approximate String Matching Let S be an infinite sequence generated by a memoryless source and S n be the sequence of prefixes of S of length n. For fixed k and r a.s. For fixed k and r(n)   (in particular for r (n) =R(S n,k,r(n)) H(D, p)=(1-D)log((1-D)/p)+D log(D/(1-p)), where p is the probability that the letters in two distinct positions are equal and 0  D  1-p.

Ravello, Settembre 2003Indexing Structures for Approximate String Matching |S|=64 R(S,k,R)~13 |S|=80 R(S,k,R)~14 |S|=128 R(S,k,R)~15 |S|=256 R(S,k,R)~16 |S|=1024 R(S,k,R)~19 |S|=16384 R(S,k,R)~25 |S|~ R(S,k,R)~30 |S|~ R(S,k,R)~35 |S|~ R(S,k,R)~47 Some Average Estimations for Random Texts Alphabet , |  |=4, r = R(S,k,r), k=2 fixed

Ravello, Settembre 2003Indexing Structures for Approximate String Matching Worst Case: R, t = O(|S|)  exponential The size of the Automaton is exponential again!! Average Case: R=O(log |S|). If t is constant  linear times a polylog for k fixed, the size of the Automaton is linear times a polylog of the size of the text S!! O(|S|  R t ). Using the Repetition Index we give a method to construct the automaton that recognizes the language L(S,k,r). Its size is a function of |S|, R(S,k,r) and the number of errors t in a window of size R(S,k,r). More precisely, the size is

Ravello, Settembre 2003Indexing Structures for Approximate String Matching Indexing |x|  R(S,k,r) |x| < R(S,k,r) Case of long patterns Case of short patterns

Ravello, Settembre 2003Indexing Structures for Approximate String Matching |x|  R(S,k,r) Build the deterministic Automaton A recognizing the language L(S,k,r). In this case, if x appears, it appears just once Label any state with an integer representing the length of the shortest path from that state to a state without outgoing edges. “Read” as long as possible the string x and, if the end of x is reached, the output is |S| minus the number associated to the arrival state minus the length of the pattern x.

Ravello, Settembre 2003Indexing Structures for Approximate String Matching |x| < R=R(S,k,r) This procedure concerns the case of short patterns and includes a non trivial reduction to the Document Listing Problem an algorithm for finding the Repetition Index standard filters for approximate string matching

Ravello, Settembre 2003Indexing Structures for Approximate String Matching The average searching time of a pattern in our data structures turns out to be linear under an hypothesis on the distribution of R(S,k,r). More precisely, we require that there exists a real number  > 1 such that if  is the expected value of R(S,k,r) for a text S of length n then the probability that R(S,k,r) >  goes to zero faster than 1/n. Under this condition, the average running time spent by our algorithm for finding the list occ(x) of all occurrences of a pattern x in a text, up to k errors in every window of size r, is proportional to|x|+|occ(x)|.

Ravello, Settembre 2003Indexing Structures for Approximate String Matching Distribution of R(S,k,r) Number of strings Repetition Index

Ravello, Settembre 2003Indexing Structures for Approximate String Matching An Application: the Longest Common Substring Problem with k Mismatches Solution: We build two automata recognizing the languages L(S 1,k 1,r) and L(S 2,k 2,r), with k 1 +k 2 =k. With a DFS we find the longest label of common paths to the two automata, starting from the two initial states. The average time spent by this algorithm is O(max {| S 1 | log(| S 1 |) k1, | S 2 | log(| S 2 |) k2 }).

Ravello, Settembre 2003Indexing Structures for Approximate String Matching Works in progress... To generalize the results proved for the Hamming distance –to the Edit (or Weighted Levenshtein) distance, that allows Insertions, Deletions and Substitutions; –to the Score functions, that are linked to Levenshtein distance and are much more used in Computational Biology; To prove the hypothesis on the distribution of R(S,k,r) (according with the experimental results obteined by A. Langiu) ; To find other applications to our data structures.