Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

Similar presentations


Presentation on theme: "Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino."— Presentation transcript:

1 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino

2 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Approximate string matching concerns to find patterns in texts in presence of “mismatches” or “errors”. It has several applications in data analysis and data retrieval, such as: The nature of mismatches depends on the problem or application considered and can be well captured in a formal way by introducing distances among strings. searching text under the presence of typing or spelling errors; retrieving musical passages; finding biological sequences in presence of possible mutations or misreads.

3 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Ex.: x=acgtatct, y=aggttact The distance d(x,y) between two strings x and y is the minimal cost of a sequence of operations that transform x into y (and  if no such sequence exists). The different possible operations are: Let d:  *x  *  R + be a distance function. We consider one of the most commonly used distance functions, the Hamming distance, that allows only substitutions, which cost 1 in the simplified definition. It is finite whenever |x|=|y| and it holds 0  d(x,y)  |x|. Ex.: x=acgtatct, y=aggttact d(x,y)=3 (in the simplified definition) 3) Substitution, 4) Transposition. 1) Insertion, 2) Deletion,

4 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Let S be a string over the alphabet  and let k, r be non negative integers such that k  r. A string v occurs in S at the position l, up to k errors in a window of size r if: 1) |v| < r  d (v, S (l,l+|v|-1) )  k; 2) |v|  r   i, 1  i  |v|-r+1, d( v(i, i+r-1),S(l+i,l+i+r-1))  k. L(S,k,r) is the set of the words that satisfy the previous definition for some l, 1  l  |S|-|v|+1. Typical approaches in this field consist in considering a percentage D of errors or fixing the number k of them. The new idea in our approach is to introduce a new parameter r and to allow at most k errors for any substring of length r, where r is not necessarily constant.

5 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching An Index over a fixed text S is an abstract data type based on the set of all factors of S, denoted by Fact(S). Such data type is equipped with some operations that allow it to answer the following query: given x  Fact(y), find the list of all its occurrences in y. This operation can easily be extended to the case of approximate string matching.

6 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Statement of the problem: Given a “text” S, a “pattern” x and two integers k and r, return all the text positions l, such that x occurs in S at position l, up to k errors for r symbols. Natural Solution: Building an automaton recognizing the language L(k,S,r). determinization minimization Exponential size!!

7 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Let u be a string over the alphabet , the neighbourhood of u is the set of all words that have at most k errors in every windows of size r respect to u, i.e.: V(u,k,r)=L(u,k,r)  |u|. Different bounds from the classical exponential ones have been obtained by using a new parameter R, called Repetition Index. The Repetition Index R(S,k,r) of S is the smallest value of an integer h such that all strings of this length occur at most in a unique position of the text up to k errors for r symbols: R(S,k,r) = min{ h  1 s.t.  i, j,1  i, j  |S| - h + 1, V(S(i,i+h-1),k,r)  V(S(j,j+h-1),k,r)   i=j}. R(S,k,r) is always defined because h=|S| is an element of the set above described; If k/r  1/2 then R(S,k,r)=|S|.

8 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Let S be an infinite sequence generated by a memoryless source and S n be the sequence of prefixes of S of length n. For fixed k and r a.s. For fixed k and r(n)   (in particular for r (n) =R(S n,k,r(n)) H(D, p)=(1-D)log((1-D)/p)+D log(D/(1-p)), where p is the probability that the letters in two distinct positions are equal and 0  D  1-p.

9 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching |S|=64 R(S,k,R)~13 |S|=80 R(S,k,R)~14 |S|=128 R(S,k,R)~15 |S|=256 R(S,k,R)~16 |S|=1024 R(S,k,R)~19 |S|=16384 R(S,k,R)~25 |S|~300.000 R(S,k,R)~30 |S|~5.000.000 R(S,k,R)~35 |S|~3.000.000.000 R(S,k,R)~47 Some Average Estimations for Random Texts Alphabet , |  |=4, r = R(S,k,r), k=2 fixed

10 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Worst Case: R, t = O(|S|)  exponential The size of the Automaton is exponential again!! Average Case: R=O(log |S|). If t is constant  linear times a polylog for k fixed, the size of the Automaton is linear times a polylog of the size of the text S!! O(|S|  R t ). Using the Repetition Index we give a method to construct the automaton that recognizes the language L(S,k,r). Its size is a function of |S|, R(S,k,r) and the number of errors t in a window of size R(S,k,r). More precisely, the size is

11 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Indexing |x|  R(S,k,r) |x| < R(S,k,r) Case of long patterns Case of short patterns

12 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching |x|  R(S,k,r) Build the deterministic Automaton A recognizing the language L(S,k,r). In this case, if x appears, it appears just once Label any state with an integer representing the length of the shortest path from that state to a state without outgoing edges. “Read” as long as possible the string x and, if the end of x is reached, the output is |S| minus the number associated to the arrival state minus the length of the pattern x.

13 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching |x| < R=R(S,k,r) This procedure concerns the case of short patterns and includes a non trivial reduction to the Document Listing Problem an algorithm for finding the Repetition Index standard filters for approximate string matching

14 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching The average searching time of a pattern in our data structures turns out to be linear under an hypothesis on the distribution of R(S,k,r). More precisely, we require that there exists a real number  > 1 such that if  is the expected value of R(S,k,r) for a text S of length n then the probability that R(S,k,r) >  goes to zero faster than 1/n. Under this condition, the average running time spent by our algorithm for finding the list occ(x) of all occurrences of a pattern x in a text, up to k errors in every window of size r, is proportional to|x|+|occ(x)|.

15 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Distribution of R(S,k,r) Number of strings Repetition Index

16 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching An Application: the Longest Common Substring Problem with k Mismatches Solution: We build two automata recognizing the languages L(S 1,k 1,r) and L(S 2,k 2,r), with k 1 +k 2 =k. With a DFS we find the longest label of common paths to the two automata, starting from the two initial states. The average time spent by this algorithm is O(max {| S 1 | log(| S 1 |) k1, | S 2 | log(| S 2 |) k2 }).

17 Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Works in progress... To generalize the results proved for the Hamming distance –to the Edit (or Weighted Levenshtein) distance, that allows Insertions, Deletions and Substitutions; –to the Score functions, that are linked to Levenshtein distance and are much more used in Computational Biology; To prove the hypothesis on the distribution of R(S,k,r) (according with the experimental results obteined by A. Langiu) ; To find other applications to our data structures.


Download ppt "Ravello, 19 - 21 Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino."

Similar presentations


Ads by Google