A Pre-Processing Algorithm for String Pattern Matching Laurence Boxer Department of Computer and Information Sciences Niagara University and Department.

A Pre-Processing Algorithm for String Pattern Matching Laurence Boxer Department of Computer and Information Sciences Niagara University and Department of Computer Science and Engineering SUNY at Buffalo

The Problem Given a “text” T of n characters and a “pattern” P of m characters, 1 < m < n, find every substring P’ of T that’s a copy of P. Applications: a)“Find” operations of word processors, Web browsers; b)molecular biologists’ search for DNA fragments in genomes or proteins in protein complexes Note amount of input is Θ(m + n) = Θ(n). Examples are known that require examination of every character of T. Hence, worst-case running time of solution is Ω(n). There exist algorithms that run in Θ(n) time, which is therefore optimal in worst case. So, what do I have that’s new & interesting?

Boyer-Moore algorithm This well-known algorithm has a worst-case running time that’s ω(n). In practice, it often runs in Θ(n) time with low constant of proportionality. There is a large class of examples for which Boyer-Moore runs in o(n) time (best case: Θ(n / m) – example of more input – larger m – resulting in faster solution). This is because the algorithm recognizes “bad characters” that enable skipping blocks of characters of T. Therefore, 1.Use Boyer-Moore methods as pre-processing step to reduce amount of data in T that need be considered, in O(n) time. 2.Apply another, linear-time algorithm to the reduced amount of data.

Analysis In worst case, there’s no data reduction, so resulting algorithm takes Θ(n) time with higher constant of proportionality than had we omitted pre-processing. When T & P are “ordinary English” with P using less of alphabet than T (which is common), expected running time is Θ(n) with smaller constant of proportionality than if we don’t pre- process as described. Best case: Θ(n / m) time.

Start by finding characters in T that can’t be last characters of matches In Θ(m) time, scan characters of P, marking which characters of alphabet appear in P. Boyer-Moore “bad character” rule: if character of T aligned with last character of P isn’t in P, then none of the m characters of T starting with this one can align with last character of P in a substring match. For a case-insensitive search, examine positions 2, 5,8,9,12,13,14,15,18,19,20; conclude positions 0-13, 15-18, 20-22 cannot be last positions of matching substrings. Note among eliminated is “t” at position 6.

Next, find positions in T not yet ruled out as final positions of substring matches This is done in O(n) time by computing the complement of the union of segments determined in previous step. In the example, only positions 14, 19 remain. Expand the intervals of possible final positions by m-1 positions to the left to obtain intervals containing possible matches – in the example, [12,14] U [17,19]. Apply a linear-time algorithm to these remaining segments of T.

Experimental results Thanks to Stephen Englert, who wrote test program Used “Z algorithm” Implementation in C++, Unix Time units are C++ “clock” units

Experimental Results – best case experiment – “ordinary English” text T: file "test2.txt", n = 2,350,367 PWith PreprocessingWithout Preprocessing "%"^48167 "%"^85167 "%"^163166 "%"^322168 "%"^641167 “%” does not occur in T, so all characters of T are “bad.”

Artificial best case experiment pattern="12345678"pattern="1234567890123456" PreprocessedNot Preproc.PreprocessedNot Preproc. text = "#"^m, m = 2 ^ kk 191370 20276173 2141502151 2283075303 231862111622

Worst case experiment – preprocessing doesn’t reduce data m = 4m = 8m = 16 kPreproc.Not Preproc.Preproc.Not Preproc.Preproc.Not Preproc. 19 159 138 158 138 159 138 20 319 278 318 277 318 276 21 648 570 644 567 644 567 22 1,303 1,148 1,299 1,153 1,289 1,147 23 2,631 2,321 2,625 2,327 2,613 2,318 T = “#” ^ n, n = 2 ^ k, P = “#” ^ m Here, preprocessing slows running time (by about 12% - 16%).

“Ordinary English” text & pattern experiment 1: Preproc. Not Preproc. P = "algorithm"41180 P = "algorithm"^24177 P = "algorithm"^44178 P = "algorithm"^82179 T: File "test2.txt", n = 2,350,367 Superlinear speedup likely due to matches vs. no matches.

“Ordinary English” text & pattern experiment 2: T: File "test2.txt", n = 2,350,367Preproc.Not Preproc. P = "parallel"9169 P = "parallel"^24170 P = "parallel"^43170 P = "parallel"^81170 9 vs. 41 for “algorithm” likely due to more “bad” characters, since “parallel” uses fewer distinct letters

A Pre-Processing Algorithm for String Pattern Matching Laurence Boxer Department of Computer and Information Sciences Niagara University and Department.

Similar presentations

Presentation on theme: "A Pre-Processing Algorithm for String Pattern Matching Laurence Boxer Department of Computer and Information Sciences Niagara University and Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Pre-Processing Algorithm for String Pattern Matching Laurence Boxer Department of Computer and Information Sciences Niagara University and Department.

Similar presentations

Presentation on theme: "A Pre-Processing Algorithm for String Pattern Matching Laurence Boxer Department of Computer and Information Sciences Niagara University and Department."— Presentation transcript:

Similar presentations

About project

Feedback