Algorithms and Data Structures lecture 9 Intro to string matching

Algorithms and Data Structures lecture 9 Intro to string matching
Szymon Grabowski Łódź, 2016

The field of string matching (aka stringology)
Myriads of problems of the type: find a string (more generally: pattern) in a text considered. Applications: text editors, pdf / ebook readers, file systems, computational biology (text is DNA, or proteins, or RNA...), web search engines, compression algorithms, template matching (in images), music retrieval (e.g., query by humming handling), AV scanners, .... 2

String matching: basic notation and terminology
Text T, pattern P, alphabet . Characters of T and P are taken from the alphabet . n = |T|, m = |P|  = || (alphabet size) $ – text terminator; abstract symbol, lex. lowest A string x is a prefix of string z iff z = xy, for some string y (y may be empty, denoted by ). If z = xy and y  , then x is a proper prefix of z. Similarly, x is a suffix of z iff z = yx for some string y. We also say that x is a factor of z iff there exist strings a, b such that z = axb. 3

Exact string matching problem
Problem: Find all occurrences of P[1..m] in text T[1..n]. We say that P occurs in T with shift s, if P[1..m] = T[s+1..s+m] (what is the largest possible s?). T is presented on-line, no time to preprocess it. On the other hand, P is usually much shorter than T, so we can (and should!) preprocess P. Fundamental problem considered for 40+ years, still new algorithms appear... 4

More advanced string (pattern) searching tasks
approximate search (several mismatches between P and a subsequence of T allowed); multiple search (looking for several patterns “at once”); extended search (classes of characters, regular expressions, etc.); global (as opposed to local) measure of similarity between strings; 2D search (in images). Very hard, if combined with rotation and/or scaling and/or lightness invariance. 5

off-line searching = indexed searching
On-line vs. off-line string search on-line (Boyer-Moore, KMP, etc.) – whole T must be scanned (even if skipping some symbols possible); off-line – preprocessing space and time involved but truly sublinear in |T|. off-line searching = indexed searching Two types of text indexes: word based indexes (e.g., Glimpse (Manber & Wu, 1994); full-text indexes (e.g., suffix trees, suffix arrays). 6

Exact string matching problem, example
7

Exact string matching problem, cont’d
The naïve (brute-force) algorithm tries to match P against each position of T. That is, for each i, i=1..n–m+1, P[1..m] is matched against T[i..i+m–1]. More precisely, compare P[1..m] to T[startpos...startpos+m+1] from left to right; if a mismatch occurs, slide P by 1 char and start anew. Worst case complexity is O(mn). In practice it is however close to O(n). Not very bad usually, but there are much faster algorithms. 8

The naïve algorithm in action
Worst case example for the naïve algorithm Let T = aaaaaaaaa...b (e.g. one b preceded with 999,999 a’s). Let P = aaaaab. At each position in T, a mismatch is found after as many as |P| = 6 char comparisons. 9

The idea of the Knuth-Morris-Pratt (KMP) alg (1977)
Let’s start with a simple observation. If a mismatch occurs at position j, with the naïve algorithm, we know that j–1 previous chars do match. KMP exploits this fact and after a mismatch at P’s pos j it shifts P by (j –1) minus the length of the longest proper prefix of P being also a suffix of P[1..j–1]. Complicated? Not really. See an example: P = abababc. T = ababad... Mismatch at position 6 (b  d). KMP, in O(1) time (thx to its preprocessing), finds that P can safely be shifted by 5 – 3. Why 5, why 3? 5 = |ababa|. 3 = |aba|. 10

KMP properties Linear time in the worst case. (But also in the best case.) O(m) extra space (for a table built in the preprocessing phase). Not practical: about 2x slower than the naïve one (acc. to: R. Baeza-Yates & G. Navarro, Text Searching: Theory and Practice). Still, to some degree it depends on the alphabet size: with a small alphabet (e.g., DNA) KMP runs relatively fast. 11

The Boyer-Moore algorithm (1977). First search algorithm with skips
KMP is optimal in the worst case but never skips characters in T. That is, (n) time also in the best case. Skipping chars in T?! You gotta be kidding. How can it be possible..? Idea: compare P against T from right to left. If, e.g., the char of T aligned with the rightmost char in P does not appear anywhere in P, we can shift P by its whole length! But how could we quickly check that some symbol does not appear in P ? 12

The Boyer-Moore idea The key observation is that, usually, m << n and  << n. Consequently, any preprocessing in O(m + ) time is practically free. The BM preprocessing involves a -sized table telling the rightmost position of each alphabet symbol in P (or zero, if a given symbol does not occur in P). Thanks to this table, the question how far can we shift P after a char mismatch?, can be answered in O(1) time. 13

Why we don’t like the original BM that much...
Boyer & Moore tried to be too smart. They used not one but two heuristics intended to maximize the pattern shift (skip). In the Cormen et al. terminology, they are bad-character heuristic and good-suffix heuristic. The skip is the max of the skips suggested by the two heuristics. In practice, however, it does not pay to complicate things. The bad-character heuristic alone is good enough. Using both heuristics makes the skip longer on avg, but the extra calculations cost too... 14

Boyer-Moore-Horspool (1980) Very simple and practical BM variant
From: R. Baeza-Yates & G. Navarro, Text Searching: Theory and Practice, to appear in Formal Languages and Applications, Physica-Verlag, Heidelberg. 15

Text T: from T.S. Eliot’s The Hippopotamus.
BMH example. Miss, miss... For technical reasons (relatively large alphabet), we do not present the whole table d. Text T: from T.S. Eliot’s The Hippopotamus. 16

BMH example. Miss... Hit! What then? We read that d[‘s’] is 12, hence we shift P by its whole length (there is no other ‘s’ in P, right?). And the search continues... 17

Worst and average case time complexities
Assumption for the avg case analysis: uniformly random char distribution, characters in T independent on each other. Same assumptions for P. Naturally, we assume m = O(n) and  = O(n), so instead of e.g., O(n+m) we’re going to write just O(n). Naïve alg: O(n) avg case, O(mn) worst case. KMP: O(n) avg and worst case. BM: O(n / min(m,)) avg case and O(mn) worst case (alas). BMH: same complexities as in BM. 18

Worst and average case time complexities, cont’d
The lower bounds on the avg and worst case time complexites are (n log(m) / m) and (n), respectively. Note that n log(m) / m is close to n / m in practice (they are equal in complexity terms as long as m = O(O(1))). Backward DAWG Matching (BDM) (Crochemore et al., 1994) alg reaches the average case complexity lower bound. Some of its variants, e.g., TurboBDM and TurboRF (Crochemore et al., 1994), reach O(n) worst case without losing on avg. 19

Multiple string matching: problem statement and motivation
Sometimes we have a set of patterns P1 , ..., Pr and the task is to find all the occurrences of any Pi (i=1..r) in T. Trivial approach: run an exact string matching alg. r times. Ways too slow, even if r moderate. (Selected) applications: batched query handling in a text collection, looking for a few spelling variants of a word / phrase (e.g., P1 = “color” and P2 = “colour”), anti-virus software (search for virus signatures). 20

Adapting the BM approach to multiple string matching
BMH used a skip table d to perform the longest safe pattern shift guided by a single char only. Having r patterns, we can perform skips, too. But they’ll be shorter, typically. Example: P1 = bbcac, P2 = abbcc, T = abadbca... 5th char of T is b, we shift all the patterns by 2 chars (2 = min(2,3)). Verifications needed. 21

Trie (aka digital tree) (Fredkin, 1960)
Etymology: reTRIEval (pronounce like try, to distinguish from tree) A trie housing the keys: an, ant, all, allot, alloy, aloe, are, ate, be 22

Natural tradeoff between search time and space occupancy.
Trie design dilemma Natural tradeoff between search time and space occupancy. If only pointers from the “existing” chars in a node are kept, it’s more space-efficient but time spent in a node is O(log ) (binary search in a node). Note: binary search is good in theory (for the worst case), but usually bad in practice (apart from top trie levels / large alphabets?). The time per node can be improved to O(1) (a single lookup) if each node takes O() space. In total, pattern search takes either O(m log ) or O(m) worst case time. 23

Let’s trie to do it better...
In most cases tries require a lot of space. A widely-used improvement: path compression, i.e., combining every non-branching node with its child = Patricia trie (Morrison, 1968). Other ideas: using smartly only one bit per pointer, or one pointer for all the children of a node. PATRICIA stands for Practical Algorithm To Retrieve Information Coded in Alphanumeric 24

Patricia trie, example

Rabin-Karp algorithm combined with binary search (Kytöjoki et al
From the cited paper: Preprocessing: hash values for all patterns are calculated and stored in an ordered table. Matching can then be done by calculating the hash value for each m-char string of the text and searching the ordered table for this hash value using binary search. If a matching hash value is found, the corresponding pattern is compared with the text. 26

Kytöjoki et al. implemented this method for m = 8, 16, and 32.
Rabin-Karp alg combined with binary search, cont’d (Kytöjoki et al., 2003) Kytöjoki et al. implemented this method for m = 8, 16, and 32. The hash values for patterns of m = 8: A 32bit int is formed of the first 4 bytes of the pattern and another from the last 4 bytes. These are then XOR’ed together resulting in the following hash function: Hash(x1 ... x8) = x1x2x3x4 ^ x5x6x7x8 The hash values for m = 16: Hash16(x1 ... x16) = ( x1x2x3x4 ^ x5x6x7x8 ) ^ ( x9x10x11x12 ^ x13x14x15x16 ) Hash32 analogously. 27

Aho–Corasick (AC) automaton (1975)
It’s a generalization of KMP. For small (in particular, constant) alphabets it is constructed in O(M) time and queries answered in O(n + z) time (M – sum of pattern lengths, z – their total number of occurrences in T). More recently: Dori & Landau (2005) show how to handle integer alphabets ( = nc, c > 0 is a constant) with O(M) construction time and O(n log  + z) query time. AC automaton: basically a trie for patterns + failure links.

A trie for the pattern set: { he, she, his, hers }
AC, example [ ] A trie for the pattern set: { he, she, his, hers } L(v) – label of node v, e.g., L(2) = she An AC automaton for the same pattern set; dashed arrows are failure transitions (links)

AC, more formally (based on [ http://www. cs. uku
3 functions: goto function: g(q, a) gives the state directly reached from current state q by matching char a; in other words, if edge (q, v) is labeled by a, then g(q, a) = v. g(0, a) = 0 for each a that does not label an edge from the root (the automaton then stays in the initial state). Otherwise, g(q, a) = .

AC, more formally, cont’d
failure function: f(q) for q   gives the state entered at a mismatch. More precisely, f(q) is the node labeled by the longest proper suffix w of L(q) such that w is a prefix of some pattern. output function: out(q) gives the set of patterns recognized when entering state q. Searching for patterns over T[1..n]

Aho–Corasick alg, search complexity analysis (based on [ ]) For each character from T, the automaton performs zero or more fail transitions, followed by one goto. Each goto either stays in the root (the g(0, a) = 0 step), or increases the state depth (=its distance from the root) by 1. Therefore, overall the state depth is increased at most n times. Each fail moves the current state closer to the root, hence the total number of fail transitions is also  n. The z occurrences are trivially reported in constant time each (typically outputting the matching pattern ID(s) and start positions of occurrences). I.e. O(z) time for that. Total search time: clearly O(n + z) (if the automaton is already built, which can be done in O(M) time).

Approximate string matching
Exact string matching problems are quite simple and almost closed in theory (new algorithms appear but most of them are useful heuristics rather than setting new achievements for the theory). Approximate matching, on the other hand, is still a very active research area. Many practical notions of “approximateness” proposed, e.g., for tolerating typos in text, false notes in music scores, variations (mutations) of DNA sequences, music melodies transposed to another key, etc. etc. 33

Edit distance (aka Levenshtein distance)
One of the most frequently used measures in string matching. edit(A, B) is the min number of elementary operations needed to convert A into B (or vice versa). Those allowed basic operations are: insert a single char, delete a single char, substitute a char. Example: edit(pile, spine) = 2 (insert s; replace l with n). 34

Edit distance recurrence
We want to compute ed(A, B). The dynamic programming algorithm is to fill the matrix C0..|A|, 0..|B| , where Ci,j holds the min number of operations to convert A1..i into B1..j. The formulas are: Ci,0 = i C0,j = j Ci,j = if (Ai = Bj) then Ci–1,j– else 1 + min(Ci–1,j, Ci,j–1, Ci–1,j–1) 35

Edit distance recurrence, rationale
( From Gonzalo Navarro’s PhD, 1998, p. 13, ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/thesis98.ps.gz ) 36

DP for edit distance, example
A = surgery, B = survey (A widely used example, e.g. from Gonzalo Navarro’s PhD, 1998 ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/thesis98.ps.gz) 37

Local similarity Global measure: ed(A,B) or search problem variant: ed(T[j’...j], P[1..m]). How to adapt the DP alg to search for a (short) pattern P in a (long) text T? Very simply. Each position in T may start a match, so we set C0,j = 0 for all i. Then we go column-wise (we calculate columns C[j]), one by one, for j=1...n) 38

But the complexity is O(mn). Even in the best case. 
DP approach Very flexible: e.g. you can associate positive weights (penalty costs) with each of the elementary error type (i.e., insertion, deletion, substitution) and then such a generalized edit distance calculation problem is solved after a trivial modification of the basic algorithm. The formula for this case (for the search problem variant), at text position j: But the complexity is O(mn). Even in the best case.  So, there have been found algorithms not always better in the worst case but better on average. 39

Partitioning lemma for the edit distance
We look for approximate occurrences of a pattern, with max allowed error k. Lemma (Rivest, 1976; Wu & Manber, 1992): If the pattern is split into k+1 (disjoint) pieces, then at least one piece must appear unaltered in an approximate occurrence of the pattern. More generally we can say that if splitting P into k+l parts then at least l pieces must appear unaltered. 40

Partitioning lemma is a special case of the Dirichlet principle
Dirichlet principle (aka pigeonhole principle) is a very obvious (but useful in math) general observation. Roughly, it says that if a pigeon is not going to occupy a pigeonhole which already contains a pigeon, there is no way to fit n pigeons in less than n pigeonholes. Others prefer an example with rabbits. If you have 10 rabbits and 9 cages, (at least) one cage must have (at least) two rabbits. Or (more appropriate for our partitioning lemma): 9 rabbits and 10 cages  one cage must be empty. 41

Dirichlet principle (if you want to be serious)
For any natural number n, there does not exist a bijection between a set S such that |S|=n and a proper subset of S. 42

Partitioning lemma in practice
Approx. string matching with max error k (edit distance): divide the pattern P into k+1 disjoint parts of length m/(k+1), run any multiple exact string matching alg for those k+1 subpatterns, verify all matches (need a tool for approximate matching anyway... Could be dynamic programming). 43

Trivially, indel(A, B)  edit(A, B).
Indel distance Very similar to edit distance, but only INsertions and DELetions are allowed. Trivially, indel(A, B)  edit(A, B). Both edit() and indel() distance functions are metrics. That is, they satisfy the four conditions: non-negativity, indentity of indescernibles, symmetry and the triangle inequality. 44

Hamming distance Very simple (but with limited applications).
By analogy to the binary alphabet, dH(S1, S2) is the number of positions at which S1 and S2 differ. If | S1 |  | S2 |, then dH(S1, S2) = . Example S1 = Donald Duck S2 = Donald Tusk dH(S1, S2) = 2. 45

Longest Common Subsequence (LCS)
Given strings A, B, |A| = n, |B| = m, find the longest subsequence shared by both strings. More precisely, find 1  i1  i2  ...  ik–1  ik  n, and 1  j1  j2  ...  jk–1  jk  m, such that A[i1] = B[j1], A[i2] = B[j2], ..., A[ik] = B[jk] and k is maximized. k is the length of the LCS(A, B), also denoted as LLCS(A, B). Sometimes we are interested in a simpler problem: finding only the LLCS, not the matching sequence. 46

LCS dynamic programming formula
LCS applications diff utility (e.g., comparing two different versions of a file, or two versions of a large programming project) molecular biology (Biologists find a new sequence. What other seq. it is most similar to?) finding the longest ascending subsequence of a permutations of the integers 1..n. longest common increasing sequence. LCS dynamic programming formula 47

LCS length calculation via dynamic programming
[ 48

LCS, anything better than plain DP?
The basic dyn. programming is clearly O(mn) in the worst case. Surprisingly, we can’t beat this result significantly in the worst case. The best practical idea for the worst case is a bit-parallel algorithm (there are a few variants) with O(n m/w) time (and a few times faster than the plain DP in practice). Still, we also have algorithms with output-dependent complexities, e.g., the Hunt-Szymanski (1977) one with O(r log m) worst case time, where r is the number of matching cells in the DP matrix (that is, r is mn in the worst case). 49

A full-text index: match to any position in T is available through it.
Text indexing If many searches are expected to be run over a text (e.g., a manual, a collection of journal papers), it is worth to sacrifice space and preprocessing time to build an index over the text supporting fast searches. A full-text index: match to any position in T is available through it. Not all text indexes are full-text ones. For example, word based indexes will find P’s occurrences in T only at word boundaries. (Quite enough in many cases, and less space consuming, often more flexible in some ways.) 50

Suffix tree (Weiner, 1973). The Lord of the Strings
Suffix tree ST(T) is basically a Patricia trie containing all n suffixes of T. Space: O(n log n) bits (but with a large constant). Construction time: O(n log ). Search time: O(m log  + occ) (occ – the number of occurrences of P in T) 51

Suffix tree example 52

ST, larger example

Suffix tree, pros and cons
+ excellent search complexity, + good search speed in practice, + some advanced queries can be handled with ST easily too, lots of space: about 20n bytes for the worst case even in best implementations (about 10n on avg in the Kurtz implementation), construction algorithms quite complicated. 54

Suffix array (Manber & Myers, 1990)
A surprisingly simple (yet efficient) idea. Sort all the suffixes of T, store their indexes in a plain array (n indexes, each 4 bytes typically). Keep T as well (total space occupancy: 4n+1n = 5n bytes, much less than with a suffix tree). Search for P: compare P against the median suffix (that is: read the median suffix index, then refer to the original T). If not found, go left or right, depending on the comparison result, each time halving the range of suffixes. So, this is binary search based. 55

SA example T = abracadabra
We could have a $ terminator after T, actually... 56

SA example, cont’d Now, sort the suffixes lexicographically
SA(T) = {11, 8, 1, 4, 6, 9, 2, 5, 7, 10, 3} 57

SA search properties The search basic mechanism is that each pattern occurring in text T must be some prefix of a suffix of T. Worst case search time: O(m log n + occ). But in practice it is closer to O(m + log n + occ). Could be sped up somewhat by maintaining a lookup table (LUT) to narrow down the initial bin-search interval (4k bytes of extra space if k-symbol ranges used) – a trick from the original M & M paper. SA: very simple and practical. 58

Correspondence between leaf nodes in the ST and elements in the SA
From: M. Kasahara and S. Morishita, Large-Scale Genome Sequence Processing, 2006.

SA-hash (Grabowski & Raniszewski, 2014)
k integers with the original Manber–Myers idea limits it to small k. But we can use a HT for only the existing suffixes’ prefixes (and thus use a larger k). All datasets of 200 MB (M=10242)

SA-hash, timings i7 4930K 3.4 GHz, 64 GB RAM (timings: ). Ubuntu bit, C++, 64-bit gcc (-O3).

SA-hash, speedups HT load factor 90%, hash function xxhash ( )

SA-hash, space use

Algorithms and Data Structures lecture 9 Intro to string matching

Similar presentations

Presentation on theme: "Algorithms and Data Structures lecture 9 Intro to string matching"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Algorithms and Data Structures lecture 9 Intro to string matching

Similar presentations

Presentation on theme: "Algorithms and Data Structures lecture 9 Intro to string matching"— Presentation transcript:

Similar presentations

About project

Feedback