Presentation is loading. Please wait.

Presentation is loading. Please wait.

Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.

Similar presentations


Presentation on theme: "Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links."— Presentation transcript:

1 Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links

2 Problem Definition Input –P, a set of z patterns {P 1, …, P z } (total length n) –text T, length m Task –Output location of all occurrences of each pattern P i in T Bounds –O(n+zm) bound using exact string matching algs –Goal: O(n+m+k) bound where k is the number of occurrences of some pattern P i in T

3 Keyword Tree P = {poet, pope, popo, too} p o e t 1 p e 2 o 3 t o o 4

4 Observations Keyword tree K construction –Can be done in O(n) time remember n is total length of all patterns Naïve search algorithm with keyword tree K –Align tree to each position in T and see if there is a match –O(nm) time Use KMP ideas to speed this up

5 Failure functions Temporary assumption –no pattern in P is a proper substring of another pattern in P Definitions –For each node v of K, L(v) denotes the concatenation of the characters from the root to node v –For any node v of K, define lp(v) to be the length of the longest proper suffix of L(v) that is a prefix of some pattern in P –For a node v of K, let f(v) denote the unique node in K with the suffix of L(v) of length lp(v) Note, f(v) = the root of K if lp(v) = 0. –Directed edge (v, f(v)) is a failure link

6 Keyword Tree and failure links P = {poet, pope, popo, too} p o e t 1 p e 2 o 3 t o o 4

7 Using failure links in search Setting: Match up to node v in k, T(c-1) in T –T(c) does not occur in any edge out of v Update –“Shift” T by c - lp(v) spots to the left This lines up T with the maximal prefix of some pattern in P that is guaranteed by definition of lp(v) –v = f(v) –Next comparison will still be with T(c) against the edges out of the new node v –Full details on page 56

8 Recursive structure for computing failure links Base Case –v is root or v is direct child of root: f(v) = root Recursive Case –Compute f(v) for v that is k+1 steps away assuming f(w) has been computed for all w <= k steps away Observation –L(v) = L(parent(v)) concatenate x x is character labeling edge (parent(v), v) –Thus, f(parent(v)) can help

9 Computing failure links Def: x is the character on (parent(v), v) Algorithm for node v w = f(parent(v)); /* using information about parent to help */ while (there is no edge out of w labeled x) and (w is not equal to r) w = f(w); if there is an edge (w, w’) out of w labeled x f(v) = w’ else f(v) = r Do this in a breadth-first manner through tree

10 Keyword Tree and failure links P = {poet, pope, popo, too} p o e t 1 p e 2 o 3 t o o 4 o

11 Linear time argument Consider a single pattern p of length t –Let p also denote path of p in K Time to compute failure links for all nodes on p is O(t) –For any v in p, lp(v) <= lp(parent(v)) + 1 Thereore, max lp(v) is t –maximum number of decrements of lp(w) and thus maximum number of assignments to w inside while loop for all nodes on path p is t (assignment in red on prev. slide) Each assignment of w in while loop decreases lp(w) by at least one lp(w) is never negative along the whole path p –Total number of assignments is O(t)

12 Allowing substrings Remove assumption –no pattern in P is a proper substring of another pattern in P Definitions –The output link (if there is one) at node v points at the numbered node v that is reachable from v following the fewest number of failure links –Adding output link computation to Algorithm for f(v) If f(v) is a numbered node, then output(v) = f(v) else if output(f(v)) is defined, then output(v) = output(f(v)) else output(v) is undefined

13 Keyword Tree and output links P = {at, pot, potato, tatter} p o t a 1 p 2 o 3 t a t 4 t t e r a t


Download ppt "Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links."

Similar presentations


Ads by Google