# Laboratory of Information Knowledge Network,

## Presentation on theme: "Laboratory of Information Knowledge Network,"— Presentation transcript:

Lecture on Information Knowledge Network "Information retrieval and pattern matching"
Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA 2011/1/7 Lecture on Information knowledge network

The 5th Regular expression matching
About regular expression Flow of processing Construction of syntax tree (parse tree) Construction of NFA for RE Simulating the NFA 2011/1/7 Lecture on Information knowledge network

What is regular expression?
Notation for flexible and strong pattern matching Example of a regular expression of filenames: > rm ＊.txt > cp Important[0-9].doc Example of a regular expression of search tool Grep： > grep –E “for.+(256|CHAR_SIZE)” ＊.c Example of a regular expression of programming language Perl： \$line = m|^http://.+\.jp/.+\$| A regular expression can express a regular set (regular language). It expresses a language L (sets of strings) which can be accepted by a finite automaton matches to any files whose extensions are “.txt. “ matches to Important0.doc～Important9.doc matches to strings which start with “http://” and include ”.jp/”. 2011/1/7 Lecture on Information knowledge network

Definition of regular expression
Definition: A regular expression is a string over Σ∪{ε, |, ･, *, (, )}, which is recursively defined by the following rules. (1) An element of {ε}∪Σ is a regular expressions. (2) If α and β are regular expressions, then (α･β) is a regular expression. (3) If α and β are regular expressions, then (α|β) is a regular expression. (4) If α is a regular expression, α* is a regular expression. (5) Only ones led on from the above are regular expressions. Example: (A･((A･T)|(C･G))*)　　→ A(AT|CG)* (α･β) is often described αβ for short ※ Symbols ‘| ’, ‘・’, ‘*’ are called operator Moreover, for a regular expression α, “+" is often used in the meaning of α+ =α・α*. 2011/1/7 Lecture on Information knowledge network

Semantic of regular expression
A regular expression is mapped into a subset of Σ* (language L) (i) ||ε|| = {ε} (ii) For a∈Σ, || a || = { a } (iii) For regular expressions α and β, ||(α・β)|| = ||α||・||β|| (iv) For regular expressions α and β, ||(α|β)|| = ||α||∪||β|| (v) For a regular expression α, ||α*|| = ||α||* For example： (a・(a | b)*) || (a・(a | b) *) || = ||a||・||(a | b)*|| = {a}・||(a | b)||* = {a}・({a}∪{b})* = { ax | x∈{a, b}* } a q1 q0 q2 b a,b An equivalent DFA to the left example ※exercise:　What is the equivalent language to (AT|GA)(TT)* ? 2011/1/7 Lecture on Information knowledge network

What is the regular expression matching problem?
It is the problem to find any strings in L(α)＝||α||, which is defined by a given α, from a given text. The ability of regular expression to define a language is equal to that of finite automaton! We can construct a finite automaton that accepts the same language expressed by a regular expression. We also can describe a regular expression that expresses the same language accepted by a finite automaton. ※ Please refer to "Automaton and computability" (2.5 regular expressions and regular sets), written by Setsuo Arikawa and Satoru Miyano. What we should do for matching a regular expression is to make an automaton (NFA/DFA) corresponding to the regular expression and then to simulate it. A regular expression is easier to convert to NFA than to DFA. The initialization state of the automaton is always active. The pattern expressed by a given regular expression occurs when the automaton reaches to the final states by reading a text. 2011/1/7 Lecture on Information knowledge network

Flow of pattern matching process
Constructing NFA by Thompson method General flow Parsing Scan texts Regular expression Parse tree NFA Report the occurrences Translate Scan texts Constructing NFA by Glushkov method DFA Flow with filtering technique Multiple pattern matching Extracting Verify Regular expression A set of strings Find the candidates Report the occurrences 2011/1/7 Lecture on Information knowledge network

Construction of parse tree
Parse tree:　a tree structure used in preparation for making NFA Each leaf node is labeled by a symbol a∈Σ or the empty word ε. Each internal node is labeled by a operator symbol on {|, ・, *}. Although a parser tool like Lex and Flex can parse regular expressions, it is too exaggerated. (The pseudo code of the next slide is enough to do that). Example: the parse tree TRE for regular expression RE=(AT|GA)((AG|AAA)*) | A T G * ( A T | G A ) ( ( A G | A A A ) * ) | * Depth of parentheses Operator | A T G A 1 1 | A G A 2 | A A 2011/1/7 Lecture on Information knowledge network

Pseudo code Parse (p=p1p2…pm, last) v ← θ; while plast≠\$ do
if plast∈Σ or plast=ε then /* normal character */ vr ← Create a node with plast; if v≠θthen v ← [・](v, vr); else v ← vr; last ← last + 1; else if plast = ‘|’ then /* union operator */ (vr, last) ← Parse(p, last + 1); v ← [ | ](v, vr); else if plast = ‘*’ then /* star operator */ v ← [ * ](v); else if plast = ‘( ’ then /* open parenthesis */ else if plast = ‘)’ then /* close parenthesis */ return (v, last); end of if end of while 2011/1/7 Lecture on Information knowledge network

NFA construction by Thompson method
K. Thompson. Regular expression search algorithm. Communications of the ACM, 11: , 1968. Idea: Traversing the parse tree TRE for a given RE in post-order traversal, we construct the automaton Th(v) that accepts language L(REv) corresponding to a partial tree whose top is node v. The key point is that Th(v) can be obtained by connecting with ε transitions the automatons corresponding to each partial tree whose top is a child of v. Properties of Thompson NFA The number of states < 2m, and the number of state transitions < 4m →O(m). It contains many ε transitions. The transitions other than ε connect the states from i to i+1. Example：　Thompson NFA for RE = (AT|GA)((AG|AAA)*) A 1 2 17 3 4 5 6 T 7 8 9 10 11 12 13 14 15 16 G ε 2011/1/7 Lecture on Information knowledge network

NFA construction algorithm
For the parse tree tree TRE, as traversing the tree in post-order traversal, it generates and connects the automatons for each node as follows. (i) When v is the empty word ε (iv) When v is a selection ”|” → (vL| vR) ε IL vL I F FL ε ε I F (ii) When v is a character “a” ε ε IR vR FR a I F (v) When v is a repetition”*” → v* (iii) When v is a concatenation ”・” → (vL・vR) ε v ε ε IL vL vR FR I F ε 2011/1/7 Lecture on Information knowledge network

Move of the NFA construction algorithm
Ex.： Parse tree TRE for RE=(AT|GA)((AG|AAA)*) 18 | A T G * 7 17 3 6 16 15 10 1 2 4 5 13 14 8 9 11 12 Ex.： Thompson NFA for RE = (AT|GA)((AG|AAA)*) A 1 2 17 3 4 5 6 T 7 8 9 10 11 12 13 14 15 16 G ε ε 9 10 A 11 G 8 ε 16 ε A 1 2 3 T ε 7 ε 17 ε 12 13 A 14 A 15 A 5 4 G 6 A 2011/1/7 Lecture on Information knowledge network

Pseudo code Thompson_recur (v)
if v = “|”(vL, vR) or v = “・”(vL, vR) then Th(vL) ← Thompson_recur(vL); Th(vR) ← Thompson_recur(vR); else if v=“*”(vC) then Th(v) ← Thompson_recur(vC); /* the above is for recursive traversal (post-order) */ if v=(ε) then return construction (i); if v=(α), α∈Σ then return construction (ii); if v=“・”(vL, vR) then return construction (iii); if v=“|”(vL, vR) then return construction (iv); if v=“*”(vC) then return construction (v); Thompon(RE) vRE ← Parse(RE\$, 1); /* construct the parse tree */ Th(vRE) ← Thompson_recur(vRE); 2011/1/7 Lecture on Information knowledge network

NFA construction by Glushkov method
V-M. Glushkov. The abstract theory of automata. Russian Mathematical Surveys, 16:1-53, 1961. Idea Making a new expression RE’ by numbering each symbol a∈∑ sequentially from the beginning to the end. (Let ∑’ be the alphabet with subscripts) Example: RE = (AT|GA)((AG|AAA)*) → RE’ = (A1T2|G3A4)((A5G6|A7A8A9)*) After constructing a NFA that accepts language L(RE'), we obtain the final NFA by removing the subscript numbers. Properties of Glushkov NFA The number of states is just m+1, and the number of state transitions is O(m2). It doesn't contain any ε transitions. For any node, all the labels of transitions entering to the node are the same. Example：　A NFA for RE’ = (A1T2|G3A4)((A5G6|A7A8A9)*) A1 1 2 T2 G3 3 4 A4 A5 5 6 G6 A7 7 8 A8 9 A9 Example：　The Glushkov NFA A 1 2 T G 3 4 5 6 7 8 9 2011/1/7 Lecture on Information knowledge network

NFA construction algorithm (1)
Construction procedure: Making a new expression RE’ by numbering each symbol a∈∑ sequentially from the beginning to the end. Pos(RE’) = {1…m}, and ∑’ is the alphabet with subscript numbers. As traversing the parse tree TRE’ in post-order traversal, for each language REv’ corresponding to a partial tree whose top is v, it calculates set First(REv’), set Last(REv’), function Emptyv, and function Follow(RE', x) of position x as follows. First(RE’) = {x∈Pos(RE’) | ∃u∈∑’*, αxu∈L(RE’)} Last(RE’) = {x∈Pos(RE’) | ∃u∈∑’*, uαx∈L(RE’)} Follow(RE’, x) = {y∈Pos(RE’) | ∃u, v∈∑’*, uαxαyv∈L(RE’)} EmptyRE: a function that returns {ε} if ε belongs to L(RE), otherwise returns φ. This can be recursively calculated as follows. Emptyε = {ε}, Emptyα∈∑ = φ, EmptyRE1|RE2 = EmptyRE1 ∪ EmptyRE2, EmptyRE1・RE2 = EmptyRE1 ∩ EmptyRE2, EmptyRE* = {ε}. The NFA is constructed based on the values obtained from the above. Positions of the initial states Positions of the final states transition functions Whether the initial state of the NFA is a final state or not? 2011/1/7 Lecture on Information knowledge network

NFA construction algorithm (2)
The Glushkov NFA GL’= (S, ∑’, I, F, δ’) that accepts language L(RE') S :A set of states. S = {0, 1, …, m} ∑' :The alphabet with subscript numbers I :The initial state id; I = 0 F :The final states; F = Last(RE’)∪(EmptyRE・{0}). δ' :Transition function defined by the followings ∀x∈ Pos(RE’), ∀y∈ Follow(RE’, x), δ’(x, αy) = y The transitions from the initial state are as follows: ∀y∈ First(RE’), δ’(0, αy) = y Example： NFA for RE’ = (A1T2|G3A4)((A5G6|A7A8A9)*) A1 1 2 T2 G3 3 4 A4 A5 5 6 G6 A7 7 8 A8 9 A9 2011/1/7 Lecture on Information knowledge network

Pseudo code Glushkov_variables (vRE, lpos)
if v = [ | ](vl,vr) or v = [・](vl,vr) then lpos ← Glushkov_variables(vl, lpos); lpos ← Glushkov_variables(vr, lpos); else if v = [ * ](v*) then lpos ← Glushkov_variables(v*, lpos); end of if if v = (ε) then First(v) ← φ, Last(v) ← φ, Emptyv ← {ε}; else if v = (a), a∈Σ then lpos ← lpos + 1; First(v) ← {lpos}, Last(v) ← {lpos}, Emptyv ← φ, Follow(lpos) ← φ; else if v = [ | ](vl,vr) then First(v) ← First(vl)∪First(vr); Last(v) ← Last(vl)∪Last(vr); Emptyv ← Emptyvl∪Emptyvr; else if v = [・](vl,vr) then First(v) ← First(vl)∪(Emptyvl・First(vr)); Last(v) ← (Emptyvr・Last(vl))∪Last(vr); Emptyv ← Emptyvl∩Emptyvr; for x∈Last(vl) do Follow(x) ← Follow(x)∪First(vr); else if v = [ * ](v*) then First(v) ← First(v*), Last(v) ← Last(v*), Emptyv ← {ε}; for x∈Last(v*) do Follow(x) ← Follow(x)∪First(v*); return lpos; O(m3) time totally It takes O(m2) time 2011/1/7 Lecture on Information knowledge network

Pseudo code (cont.) Glushkov (RE)
/* make the parse tree by parsing the regular expression */ vRE ← Parse(RE\$, 1); /* calculate each variable by using the parse tree */ m ← Glushkov_variables(vRE, 0); /* construct NFA GL(S,∑, I, F,δ) by the variables */ Δ←φ; for i ∈ 0…m do create state I; for x ∈ First(vRE) do Δ←Δ∪ {(0, αx, x)}; for i ∈ 0…m do for i ∈ Follow(i) do Δ←Δ∪ {(i,αx, x)}; end of for for x∈ Last(vRE)∪(EmptyvRE・{0}) do mark x as terminal; 2011/1/7 Lecture on Information knowledge network

Taiwan High-speed Railway@Taipei 2011.11.8
Take a breath Taiwan High-speed 2011/1/7 Lecture on Information knowledge network

Flow of pattern matching process
An NFA can be simulated in O(mn) time Constructing NFA by Thompson method Parsing Scan texts Regular expression Parse tree NFA Report the occurrences Translate Scan texts Constructing NFA by Glushkov method To translate, we need O(2m) time and space DFA There exists a method of converting directly into a DFA ※Please refer the section 3.9 of “Compilers – Principles, Techniques and Tools,” written by A. V. Aho, R. Sethi, and J. D. Ullman, Addison-Wesley, 1986. 2011/1/7 Lecture on Information knowledge network

Methods of simulating NFAs
Simulating a Thompson NFA directly The most naïve method Storing the current active states with a list of size O(m), the method updates the states of the NFA in O(m) time for each symbol read from a text. It obviously takes O(mn) time. Simulating a Thompson NFA by converting into an equivalent DFA It is a classical technique. Refer “Compilers – Principles, Techniques and Tools,” written by A. V. Aho, R. Sethi, and J. D. Ullman, Addison-Wesley, 1986. The conversion is done as preprocessing → it takes O(2m) time and space. There are also techniques that converses dynamically as scanning a text. Hybrid method E. W. Myers. A four russians algorithm for regular expression pattern matching. Journal of the ACM, 39(2): , 1992. It is a method that combines NFA and DFA to do efficient matching. It divides the Thompson NFA into modules which include O(k) nodes for each, and then converses each module into DFA. It simulates the transitions between modules as a NFA. High-speed NFA simulation by bit-parallel technique Simulating the Thompson NFA: proposed by Wu and U. Manber[1992] Simulating the Glushkov NFA: proposed by G. Navarro and M. Raffinot[1999] 2011/1/7 Lecture on Information knowledge network

Simulating by converting into an equivalent DFA
Ex.： A DFA converted from the Glushkov NFA for RE = (AT|GA)((AG|AAA)*) 0 1 A 0 2 0 3 0 4 0 3 6 0 1 8 0 1 9 C,T C G T C,T DFA Classical (N = (Q,∑, I, F,Δ), T = t1t2…tn) Preprocessing: for σ∈∑ do Δ←Δ∪ (i, σ, I); (Qd,∑, Id, Fd,δ) ← BuildDFA(N); /* Make an equivalent DFA with NFA N */ Searching: s ← Id; for pos ∈ 1…n do if s∈Fd then report an occurrence ending at pos – 1; s ← δ(s, tpos); end of for 2011/1/7 Lecture on Information knowledge network

Bit-parallel Thompson (BPThompson)
S. Wu and U. Manber. Fast text searching allowing errors. Communications of the ACM, 35(10):83-91, 1992. Simulating the Thompson NFA by bit-parallel technique For Thompson NFAs, note that the next of the i-th state is the i+1th except for ε transitions. → Bit-parallelism similar to the Shift-and method can be applicable. ε transitions are separately simulated. This needs the mask table of size 2L (L is the number of states of the NFA) It takes O(2L + m|∑|) time for preprocessing. It scans a text in O(n) time when L is small enough. About NFA GL=(Q={s0,…,s|Q|-1}, ∑, I = s0, F, Δ) The expression of mask tables of the NFA: Qn={0,…,|Q-1|}, In = 0|Q|-11, Fn = |sj∈F 0|Q|-1-j10j Definitions of mask tables: Bn[i,σ] = |(si,σ,sj)∈Δ 0|Q|-1-j10j En[ i ] = |sj∈E(i) 0|Q|-1-j10j (where E(i) is the ε-closure of state si) Ed[D] = |i, i=0 OR D&0L-i-110i ≠ 0L En[ i ] B[σ] = |i∈0…m Bn[i, σ] 2011/1/7 Lecture on Information knowledge network

Pseudo code BuildEps (N = (Qn,∑,In,Fn,Bn,En) ) for σ∈∑ do B[σ] ← 0L;
for i∈0…L–1 do B[σ] ← B[σ] | Bn[i,σ]; end of for Ed[0] ← En[0]; for i∈0…L–1 do for j∈0…2i – 1 do Ed[2i + j] ← En[ i ] | Ed[ j ]; return (B, Ed); BPThompson (N = (Qn,∑,In,Fn,Bn,En), T = t1t2…tn) Preprocessing: (B, Ed) ← BuildEps(N); Searching: D ← Ed[ In ]; /* initial state */ for pos∈1…n do if D & Fn≠ 0L then report an occurrence ending at pos–1; D ← Ed[ (D << 1) & B[tpos] ]; end of for 2011/1/7 Lecture on Information knowledge network

Bit-parallel Glushkov (BPGlushkov)
G. Navarro and M. Raffinot. Fast regular expression search. In Proc. of WAE99, LNCS1668, , 1999. Simulating the Glushkov NFA by bit-parallel technique For Glushkov NFAs, note that, for any node, all the labels of transitions entering to the node are the same. → Although the bit-parallel similar to the Shift-And method cannot be applicable, each state transition can be calculated by Td[D]&B[σ]. The number of mask tables is 2|Q| (while it is 2L for BPThompson). It takes O(2m + m|∑|) time for preprocessing. It scans a text in O(n) time when m is small enough. It is more efficient than BPThompson in almost all cases. About NFA GL=(Q={s0,…,s|Q|-1}, ∑, I = s0, F, Δ) The expression of mask tables of the NFA: Qn={0,…,|Q-1|}, In = 0|Q|-11, Fn = |sj∈F 0|Q|-1-j10j Definitions of mask tables: Bn[i,σ] = |(si,σ,sj)∈Δ 0|Q|-1-j10j B[σ] = |i∈0…m Bn[i, σ] Td[D] = |(i,σ), D&0m-i10i ≠ 0m+1, σ∈∑ Bn[i,σ] 2011/1/7 Lecture on Information knowledge network

Pseudo code BuildTran (N = (Qn,∑,In,Fn,Bn,En) )
for i∈0…m do A[ i ] ← 0m+1; for σ∈∑ do B[σ] ← 0m+1; for i∈0…m, σ∈∑ do A[ i ] ← A[ i ] | Bn[I,σ]; B[σ] ← B[σ] | Bn[i,σ]; end of for Td[0] ← 0m+1; for i∈0…m do for j∈0…2i – 1 do Td[2i + j] ← A[ i ] | Td[ j ]; return (B, Ed); BPGlushkov (N = (Qn,∑,In,Fn,Bn,En), T = t1t2…tn) Preprocessing: for σ∈∑ do Bn[0,σ] ← Bn[0,σ] | 0m1; /* initial self-loop */ (B, Ed) ← BuildTran(N); Searching: D ← 0m1; /* initial state */ for pos∈1…n do if D & Fn≠ 0m+1 then report an occurrence ending at pos–1; D ← Td[D] & B[tpos]; end of for 2011/1/7 Lecture on Information knowledge network

Other topics Extended regular expression:
The one with allowing two operations, intersection and complementation, in addition to connection, selection, and repetition. ￢(UNIX)∧(UNI(.)* | (.)*NIX) It is different from POSIX regular expression. H. Yamamoto, An Automata-based Recognition Algorithm for Semi-extended Regular Expressions, Proc. MFCS2000, LNCS1893, , 2000. O. Kupferman and S. Zuhovitzky, An Improved Algorithm for the Membership Problem for Extended Regular Expressions, Proc. MFCS2002, LNCS2420, , 2002. Researches on speeding-up regular expression matching Filtration technique using BNDM + verification G. Navarro and M. Raffinot, New Techniques for Regular Expression Searching, Algorithmica, 41(2): , 2004. - In this paper, the method of simulating the Glushkov NFA with mask tables of O(m2m) bits is also presented. 2011/1/7 Lecture on Information knowledge network

The 5th summary Regular expression Flow of regular expression matching
the ability of it to define the language is the same as that of finite automaton. Flow of regular expression matching After translating it to a parse tree, the corresponding NFA is constructed. Matching is done by simulating the NFA Filtration + pattern plurals collation + inspection + NFA simulation Methods for constructing an NFA Thompson NFA: The number of states < 2m, and the number of state transitions < 4m →O(m). It contains many ε transitions. The transitions other than ε connect the states from i to i+1. Glushkov NFA The number of states is just m+1, and the number of state transitions is O(m2). It doesn't contain any ε transitions. For any node, all the labels of transitions entering to the node are the same. Methods of simulating NFAs Simulating Thompson NFAs directly → O(mn) time Converting into an equivalent DFA → It runs in O(n) for scanning, but it takes O(2m) time and space for preprocessing. Speeding-up by bit-parallel techniques:　Bit-parallel Thompson and Bit-parallel Glushkov The next theme Pattern matching on compressed texts: an introduction of Kida’s research (it’s a trend of 90's in this field!) 2011/1/7 Lecture on Information knowledge network

Appendix About the definitions of terms which I didn’t explain in the first lecture. A subset of ∑* is called a formal language or a language for short. For languages L1, L2∈∑*, the set { xy | x∈L1 and y∈L2 } is called a product of L1 and L2, and denoted by L1・L2 or L1L2 for short. For a language L⊆∑*, we define L0 = {ε}, Ln = Ln-1・L (n≧1) Moreover, we define L* = ∪n=0…∞ Ln and call it as a closure of L. We also denote L+ = ∪n=1…∞ Ln. About look-behind notations I told in the lecture that I couldn’t find the precise description of look-behind notations. But I eventually found that! Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity, The MIT Press, Elsevier, 1990. (Japanese translation)コンピュータ基礎理論ハンドブックⅠ：アルゴリズムと複雑さ，丸善，1994. Chapter 5, section 2.3 and section 6.1 According to this, it seems that the notion of look-behind appeared in 1964. It exceeds the frame of context-free grammar! The matching problem of it is proved to be NP-complete. 2011/1/7 Lecture on Information knowledge network