Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

Similar presentations


Presentation on theme: "Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching""— Presentation transcript:

1 Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA

2 The 5th Regular expression matching About regular expression Flow of processing Construction of syntax tree (parse tree) Construction of NFA for RE Simulating the NFA 2011/1/7 Lecture on Information knowledge network 2

3 Hokkaido University 3 Lecture on Information knowledge network 2011/1/7 What is regular expression? Notation for flexible and strong pattern matching –Example of a regular expression of filenames: > rm.txt > cp Important[0-9].doc –Example of a regular expression of search tool Grep > grep –E for.+(256|CHAR_SIZE).c –Example of a regular expression of programming language Perl $line = m|^http://.+\.jp/.+$| A regular expression can express a regular set (regular language). –It expresses a language L (sets of strings) which can be accepted by a finite automaton matches to any files whose extensions are.txt. matches to Important0.doc Important9.doc matches to strings which start with and include.jp/.

4 Hokkaido University 4 Lecture on Information knowledge network 2011/1/7 Definition of regular expression Definition: A regular expression is a string over Σ {ε, |,, *, (, )}, which is recursively defined by the following rules. –(1) An element of {ε} Σ is a regular expressions. –(2) If α and β are regular expressions, then (α β) is a regular expression. –(3) If α and β are regular expressions, then (α|β) is a regular expression. –(4) If α is a regular expression, α* is a regular expression. –(5) Only ones led on from the above are regular expressions. Example: (A ((A T)|(C G))*) A(AT|CG)* (α β) is often described αβ for short Symbols |,, * are called operator. Moreover, for a regular expression α, +" is often used in the meaning of α+ =α α*.

5 Hokkaido University 5 Lecture on Information knowledge network 2011/1/7 Semantic of regular expression A regular expression is mapped into a subset of Σ * (language L) –(i) ||ε|| = {ε} –(ii) For a Σ, || a || = { a } –(iii)For regular expressions α and β, ||(α β)|| = ||α|| ||β|| –(iv)For regular expressions α and β, ||(α|β)|| = ||α|| ||β|| –(v)For a regular expression α, ||α*|| = ||α|| * For example (a (a | b)*) || (a (a | b) *) || = ||a|| ||(a | b)*|| = {a} ||(a | b)|| * = {a} ({a} {b}) * = { ax | x {a, b} * } a q1q1 q0q0 q2q2 b a,b An equivalent DFA to the left example exercise: What is the equivalent language to (AT|GA)(TT)* ?

6 Hokkaido University 6 Lecture on Information knowledge network 2011/1/7 What is the regular expression matching problem? Regular expression matching problem –It is the problem to find any strings in L(α) ||α||, which is defined by a given α, from a given text. The ability of regular expression to define a language is equal to that of finite automaton! –We can construct a finite automaton that accepts the same language expressed by a regular expression. –We also can describe a regular expression that expresses the same language accepted by a finite automaton. Please refer to "Automaton and computability" (2.5 regular expressions and regular sets), written by Setsuo Arikawa and Satoru Miyano. What we should do for matching a regular expression is to make an automaton (NFA/DFA) corresponding to the regular expression and then to simulate it. –A regular expression is easier to convert to NFA than to DFA. –The initialization state of the automaton is always active. –The pattern expressed by a given regular expression occurs when the automaton reaches to the final states by reading a text.

7 Hokkaido University 7 Lecture on Information knowledge network 2011/1/7 Flow of pattern matching process Regular expression Parse treeNFA Report the occurrences DFA Parsing Constructing NFA by Thompson method Constructing NFA by Glushkov method Translate Scan texts General flow Flow with filtering technique Regular expression A set of strings Find the candidates Report the occurrences Extracting Multiple pattern matching Verify

8 Hokkaido University 8 Lecture on Information knowledge network 2011/1/7 Construction of parse tree Parse tree: a tree structure used in preparation for making NFA –Each leaf node is labeled by a symbol a Σ or the empty word ε. –Each internal node is labeled by a operator symbol on {|,, *}. –Although a parser tool like Lex and Flex can parse regular expressions, it is too exaggerated. (The pseudo code of the next slide is enough to do that). | ATGA |* AG AA A Example: the parse tree T RE for regular expression RE=(AT|GA)((AG|AAA)*) ( A T | G A ) ( ( A G | A A A ) * ) 1 Depth of parentheses Operator |1 2| | ATGA | * AG AA A

9 Hokkaido University 9 Lecture on Information knowledge network 2011/1/7 Pseudo code Parse (p=p 1 p 2 …p m, last) 1 v θ; 2 while p last $ do 3 if p last Σ or p last =ε then/* normal character */ 4 v r Create a node with p last ; 5 if vθthen v [ ](v, v r ); 6 else v v r ; 7 last last + 1; 8 else if p last = | then/* union operator */ 9 (v r, last) Parse(p, last + 1); 10 v [ | ](v, v r ); 11 else if p last = * then/* star operator */ 12v [ * ](v); 13 last last + 1; 14 else if p last = ( then /* open parenthesis */ 15 (v r, last) Parse(p, last + 1); 16 last last + 1; 17if vθthen v [ ](v, v r ); 18 else v v r ; 19 else if p last = ) then /* close parenthesis */ 20 return (v, last); 21 end of if 22 end of while 23 return (v, last);

10 Hokkaido University 10 Lecture on Information knowledge network 2011/1/7 NFA construction by Thompson method Idea: –Traversing the parse tree T RE for a given RE in post-order traversal, we construct the automaton Th(v) that accepts language L(REv) corresponding to a partial tree whose top is node v. –The key point is that Th(v) can be obtained by connecting with ε transitions the automatons corresponding to each partial tree whose top is a child of v. Properties of Thompson NFA –The number of states < 2m, and the number of state transitions < 4m O(m). –It contains many ε transitions. –The transitions other than ε connect the states from i to i+1. K. Thompson. Regular expression search algorithm. Communications of the ACM, 11: , Example Thompson NFA for RE = (AT|GA)((AG|AAA)*) A T GA AG AAA ε ε ε ε ε ε ε ε ε ε ε ε

11 Hokkaido University 11 Lecture on Information knowledge network 2011/1/7 NFA construction algorithm IF (i) When v is the empty word ε ε (ii) When v is a character a IF a (iii) When v is a concatenation (v L v R ) ILIL vLvL FRFR vRvR (iv) When v is a selection | (v L | v R ) ILIL vLvL FLFL FI IRIR vRvR FRFR ε εε ε (v) When v is a repetition* v* v FI ε ε ε ε For the parse tree tree T RE, as traversing the tree in post-order traversal, it generates and connects the automatons for each node as follows.

12 Hokkaido University 12 Lecture on Information knowledge network 2011/1/7 Move of the NFA construction algorithm Ex. Thompson NFA for RE = (AT|GA)((AG|AAA)*) A 123 T 54 G 6 A 910 A 11 G 1213 A 14 A 15 A 0 ε ε 7 ε ε 8 ε ε 16 ε ε 17 ε ε ε ε Ex. Parse tree T RE for RE=(AT|GA)((AG|AAA)*) | ATGA | * AG AA A A T GA AG AAA ε ε ε ε ε ε ε ε ε ε ε ε

13 Hokkaido University 13 Lecture on Information knowledge network 2011/1/7 Pseudo code Thompson_recur (v) 1 if v = |(v L, v R ) or v =(v L, v R ) then 2 Th(v L ) Thompson_recur(v L ); 3 Th(v R ) Thompson_recur(v R ); 4 else if v=*(v C ) then Th(v) Thompson_recur(v C ); 5/* the above is for recursive traversal (post-order) */ 6 if v=(ε) then return construction (i); 7 if v=(α), α Σ then return construction (ii); 8 if v=(v L, v R ) then return construction (iii); 9 if v=|(v L, v R ) then return construction (iv); 10 if v=*(v C ) then return construction (v); Thompon(RE) 11 v RE Parse(RE$, 1); /* construct the parse tree */ 12 Th(v RE ) Thompson_recur(v RE );

14 Hokkaido University 14 Lecture on Information knowledge network 2011/1/7 NFA construction by Glushkov method Idea –Making a new expression RE by numbering each symbol a sequentially from the beginning to the end. (Let be the alphabet with subscripts) Example: RE = (AT|GA)((AG|AAA)*) RE = (A 1 T 2 |G 3 A 4 )((A 5 G 6 |A 7 A 8 A 9 )*) –After constructing a NFA that accepts language L(RE'), we obtain the final NFA by removing the subscript numbers. Properties of Glushkov NFA –The number of states is just m+1, and the number of state transitions is O(m 2 ). –It doesn't contain any ε transitions. –For any node, all the labels of transitions entering to the node are the same. V-M. Glushkov. The abstract theory of automata. Russian Mathematical Surveys, 16:1-53, A 012 T G 34 AA 56 GA 78 A 9 A A A A A A A A1A1 012 T2T2 G3G3 34 A4A4 A5A5 56 G6G6 A7A7 78 A8A8 9 A9A9 A7A7 A5A5 A7A7 A5A5 A7A7 A5A5 Example The Glushkov NFA Example A NFA for RE = (A 1 T 2 |G 3 A 4 )((A 5 G 6 |A 7 A 8 A 9 )*)

15 Hokkaido University 15 Lecture on Information knowledge network 2011/1/7 NFA construction algorithm (1) Construction procedure: –Making a new expression RE by numbering each symbol a sequentially from the beginning to the end. Pos(RE) = {1…m}, and is the alphabet with subscript numbers. –As traversing the parse tree T RE in post-order traversal, for each language RE v corresponding to a partial tree whose top is v, it calculates set First(RE v ), set Last(RE v ), function Empty v, and function Follow(RE', x) of position x as follows. First(RE) = {x Pos(RE) | u *, α x u L(RE)} Last(RE) = {x Pos(RE) | u *, uα x L(RE)} Follow(RE, x) = {y Pos(RE) | u, v *, uα x α y v L(RE)} Empty RE : a function that returns {ε} if ε belongs to L(RE), otherwise returns φ. This can be recursively calculated as follows. Empty ε = {ε}, Empty α = φ, Empty RE1|RE2 = Empty RE1 Empty RE2, Empty RE1 RE2 = Empty RE1 Empty RE2, Empty RE* = {ε}. –The NFA is constructed based on the values obtained from the above. Positions of the initial states Positions of the final states transition functions Whether the initial state of the NFA is a final state or not?

16 Hokkaido University 16 Lecture on Information knowledge network 2011/1/7 NFA construction algorithm (2) The Glushkov NFA GL= (S,, I, F, δ) that accepts language L(RE') –S :A set of states. S = {0, 1, …, m} –':The alphabet with subscript numbers –I:The initial state id; I = 0 –F:The final states; F = Last(RE) (Empty RE {0}). –δ':Transition function defined by the followings x Pos(RE), y Follow(RE, x), δ(x, α y ) = y The transitions from the initial state are as follows: y First(RE), δ(0, α y ) = y A1A1 012 T2T2 G3G3 34 A4A4 A5A5 56 G6G6 A7A7 78 A8A8 9 A9A9 A7A7 A5A5 A7A7 A5A5 A7A7 A5A5 Example NFA for RE = (A 1 T 2 |G 3 A 4 )((A 5 G 6 |A 7 A 8 A 9 )*)

17 Hokkaido University 17 Lecture on Information knowledge network 2011/1/7 Pseudo code Glushkov_variables (v RE, lpos) 1 if v = [ | ](v l,v r ) or v = [ ](v l,v r ) then 2 lpos Glushkov_variables(v l, lpos); 3 lpos Glushkov_variables(v r, lpos); 4 else if v = [ * ](v * ) then lpos Glushkov_variables(v *, lpos); 5 end of if 6 if v = (ε) then 7 First(v) φ, Last(v) φ, Empty v {ε}; 8 else if v = (a), a Σ then 9 lpos lpos + 1; 10 First(v) {lpos}, Last(v) {lpos}, Empty v φ, Follow(lpos) φ; 11 else if v = [ | ](v l,v r ) then 12 First(v) First(v l ) First(v r ); 13 Last(v) Last(v l ) Last(v r ); 14 Empty v Empty vl Empty vr ; 15 else if v = [ ](v l,v r ) then 16 First(v) First(v l ) (Empty vl First(v r )); 17 Last(v) (Empty vr Last(v l )) Last(v r ); 18 Empty v Empty vl Empty vr ; 19 for x Last(v l ) do Follow(x) Follow(x) First(v r ); 20 else if v = [ * ](v * ) then 21First(v) First(v * ), Last(v) Last(v * ), Empty v {ε}; 22for x Last(v * ) do Follow(x) Follow(x) First(v * ); 23 end of if 24 return lpos; It takes O(m 2 ) time O(m 3 ) time totally

18 Hokkaido University 18 Lecture on Information knowledge network 2011/1/7 Pseudo code (cont.) Glushkov (RE) 1 /* make the parse tree by parsing the regular expression */ 2 v RE Parse(RE$, 1); 3 4 /* calculate each variable by using the parse tree */ 5 m Glushkov_variables(v RE, 0); 6 7 /* construct NFA GL(S,, I, F,δ) by the variables */ 8 Δφ; 9 for i 0…m do create state I; 10 for x First(v RE ) do ΔΔ {(0, α x, x)}; 11 for i 0…m do 12 for i Follow(i) do ΔΔ {(i,α x, x)}; 13 end of for 14 for x Last(v RE ) (Empty vRE {0}) do mark x as terminal;

19 Hokkaido University 19 Lecture on Information knowledge network 2011/1/7 Take a breath Taiwan High-speed

20 Hokkaido University 20 Lecture on Information knowledge network 2011/1/7 Flow of pattern matching process Regular expression Parse treeNFA Report the occurrences DFA Parsing An NFA can be simulated in O(mn) time To translate, we need O(2 m ) time and space There exists a method of converting directly into a DFA Please refer the section 3.9 of Compilers – Principles, Techniques and Tools, written by A. V. Aho, R. Sethi, and J. D. Ullman, Addison-Wesley, Constructing NFA by Thompson method Constructing NFA by Glushkov method Translate Scan texts

21 Hokkaido University 21 Lecture on Information knowledge network 2011/1/7 Methods of simulating NFAs Simulating a Thompson NFA directly –The most naïve method –Storing the current active states with a list of size O(m), the method updates the states of the NFA in O(m) time for each symbol read from a text. –It obviously takes O(mn) time. Simulating a Thompson NFA by converting into an equivalent DFA –It is a classical technique. –Refer Compilers – Principles, Techniques and Tools, written by A. V. Aho, R. Sethi, and J. D. Ullman, Addison-Wesley, –The conversion is done as preprocessing it takes O(2 m ) time and space. –There are also techniques that converses dynamically as scanning a text. Hybrid method –E. W. Myers. A four russians algorithm for regular expression pattern matching. Journal of the ACM, 39(2): , –It is a method that combines NFA and DFA to do efficient matching. –It divides the Thompson NFA into modules which include O(k) nodes for each, and then converses each module into DFA. It simulates the transitions between modules as a NFA. High-speed NFA simulation by bit-parallel technique –Simulating the Thompson NFA: proposed by Wu and U. Manber[1992] –Simulating the Glushkov NFA: proposed by G. Navarro and M. Raffinot[1999]

22 Hokkaido University 22 Lecture on Information knowledge network 2011/1/7 Simulating by converting into an equivalent DFA DFA Classical (N = (Q,, I, F,Δ), T = t 1 t 2 …t n ) 1 Preprocessing: 2 for σ do ΔΔ (i, σ, I); 3 (Q d,, I d, F d,δ) BuildDFA(N); /* Make an equivalent DFA with NFA N */ 4 Searching: 5 s I d ; 6 for pos 1…n do 7 if s F d then report an occurrence ending at pos – 1; 8 s δ(s, t pos ); 9 end of for C,T Ex. A DFA converted from the Glushkov NFA for RE = (AT|GA)((AG|AAA)*)

23 Hokkaido University 23 Lecture on Information knowledge network 2011/1/7 Bit-parallel Thompson (BPThompson) Simulating the Thompson NFA by bit-parallel technique –For Thompson NFAs, note that the next of the i-th state is the i+1th except for ε transitions. Bit-parallelism similar to the Shift-and method can be applicable. –ε transitions are separately simulated. This needs the mask table of size 2 L (L is the number of states of the NFA) –It takes O(2 L + m||) time for preprocessing. –It scans a text in O(n) time when L is small enough. About NFA GL=(Q={s 0,…,s |Q|-1 },, I = s 0, F, Δ) –The expression of mask tables of the NFA: Q n ={0,…,|Q-1|}, I n = 0 |Q|-1 1, F n = | sj F 0 |Q|-1-j 10 j –Definitions of mask tables: B n [i,σ] = | (si,σ,sj) Δ 0 |Q|-1-j 10 j E n [ i ] = | sj E(i) 0 |Q|-1-j 10 j (where E(i) is the ε-closure of state s i ) E d [D] = | i, i=0 OR D&0 L-i-1 10 i 0 L En[ i ] B[σ] = | i 0…m B n [i, σ] S. Wu and U. Manber. Fast text searching allowing errors. Communications of the ACM, 35(10):83-91, 1992.

24 Hokkaido University 24 Lecture on Information knowledge network 2011/1/7 Pseudo code BuildEps (N = (Q n,,I n,F n,B n,E n ) ) 1 for σ do 2 B[σ] 0 L ; 3 for i 0…L–1 do B[σ] B[σ] | B n [i,σ]; 4 end of for 5 E d [0] E n [0]; 6 for i 0…L–1 do 7 for j 0…2 i – 1 do 8E d [2 i + j] E n [ i ] | E d [ j ]; 9 end of for 10 end of for 11 return (B, E d ); BPThompson (N = (Q n,,I n,F n,B n,E n ), T = t 1 t 2 …t n ) 1 Preprocessing: 2(B, E d ) BuildEps(N); 3 Searching: 4 D E d [ I n ];/* initial state */ 5 for pos 1…n do 6if D & F n 0 L then report an occurrence ending at pos–1; 7 D E d [ (D << 1) & B[t pos ] ]; 8 end of for

25 Hokkaido University 25 Lecture on Information knowledge network 2011/1/7 Bit-parallel Glushkov (BPGlushkov) Simulating the Glushkov NFA by bit-parallel technique –For Glushkov NFAs, note that, for any node, all the labels of transitions entering to the node are the same. Although the bit-parallel similar to the Shift-And method cannot be applicable, each state transition can be calculated by T d [D]&B[σ]. –The number of mask tables is 2 |Q| (while it is 2 L for BPThompson). –It takes O(2 m + m||) time for preprocessing. –It scans a text in O(n) time when m is small enough. –It is more efficient than BPThompson in almost all cases. About NFA GL=(Q={s 0,…,s |Q|-1 },, I = s 0, F, Δ) –The expression of mask tables of the NFA: Q n ={0,…,|Q-1|}, I n = 0 |Q|- 1 1, F n = | sj F 0 |Q|-1-j 10 j –Definitions of mask tables: B n [i,σ] = | (si,σ,sj) Δ 0 |Q|-1-j 10 j B[σ] = | i 0…m B n [i, σ] T d [D] = | (i,σ), D&0 m-i 10 i 0 m+1, σ B n [i,σ] G. Navarro and M. Raffinot. Fast regular expression search. In Proc. of WAE99, LNCS1668, , 1999.

26 Hokkaido University 26 Lecture on Information knowledge network 2011/1/7 Pseudo code BuildTran (N = (Q n,,I n,F n,B n,E n ) ) 1 for i 0…m do A[ i ] 0 m+1 ; 2 for σ do B[σ] 0 m+1 ; 3 for i 0…m, σ do 4 A[ i ] A[ i ] | B n [I,σ]; 5 B[σ] B[σ] | B n [i,σ]; 6 end of for 7 T d [0] 0 m+1 ; 8 for i 0…m do 9 for j 0…2 i – 1 do 10T d [2 i + j] A[ i ] | T d [ j ]; 11 end of for 12 end of for 13 return (B, E d ); BPGlushkov (N = (Q n,,I n,F n,B n,E n ), T = t 1 t 2 …t n ) 1 Preprocessing: 2 for σ do B n [0,σ] B n [0,σ] | 0 m 1;/* initial self-loop */ 3(B, E d ) BuildTran(N); 4 Searching: 5 D 0 m 1;/* initial state */ 6 for pos 1…n do 7if D & F n 0 m+1 then report an occurrence ending at pos–1; 8 D T d [D] & B[t pos ]; 9 end of for

27 Hokkaido University 27 Lecture on Information knowledge network 2011/1/7 Other topics Extended regular expression: –The one with allowing two operations, intersection and complementation, in addition to connection, selection, and repetition. (UNIX) (UNI(.)* | (.)*NIX) –It is different from POSIX regular expression. H. Yamamoto, An Automata-based Recognition Algorithm for Semi- extended Regular Expressions, Proc. MFCS2000, LNCS1893, , O. Kupferman and S. Zuhovitzky, An Improved Algorithm for the Membership Problem for Extended Regular Expressions, Proc. MFCS2002, LNCS2420, , Researches on speeding-up regular expression matching –Filtration technique using BNDM + verification G. Navarro and M. Raffinot, New Techniques for Regular Expression Searching, Algorithmica, 41(2): , In this paper, the method of simulating the Glushkov NFA with mask tables of O(m2 m ) bits is also presented.

28 Hokkaido University 28 Lecture on Information knowledge network 2011/1/7 The 5th summary Regular expression –the ability of it to define the language is the same as that of finite automaton. Flow of regular expression matching –After translating it to a parse tree, the corresponding NFA is constructed. Matching is done by simulating the NFA –Filtration + pattern plurals collation + inspection + NFA simulation Methods for constructing an NFA –Thompson NFA: The number of states < 2m, and the number of state transitions < 4m O(m). It contains many ε transitions. The transitions other than ε connect the states from i to i+1. –Glushkov NFA The number of states is just m+1, and the number of state transitions is O(m 2 ). It doesn't contain any ε transitions. For any node, all the labels of transitions entering to the node are the same. Methods of simulating NFAs –Simulating Thompson NFAs directly O(mn) time –Converting into an equivalent DFA It runs in O(n) for scanning, but it takes O(2 m ) time and space for preprocessing. –Speeding-up by bit-parallel techniques: Bit-parallel Thompson and Bit-parallel Glushkov The next theme –Pattern matching on compressed texts: an introduction of Kidas research (its a trend of 90's in this field!)

29 Hokkaido University 29 Lecture on Information knowledge network 2011/1/7 Appendix About the definitions of terms which I didnt explain in the first lecture. –A subset of * is called a formal language or a language for short. –For languages L 1, L 2 *, the set { xy | x L 1 and y L 2 } is called a product of L 1 and L 2, and denoted by L 1 L 2 or L 1 L 2 for short. –For a language L *, we define L 0 = {ε}, L n = L n-1 L(n 1) Moreover, we define L * = n=0… L n and call it as a closure of L. We also denote L + = n=1… L n. About look-behind notations –I told in the lecture that I couldnt find the precise description of look-behind notations. But I eventually found that! Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity, The MIT Press, Elsevier, (Japanese translation) –Chapter 5, section 2.3 and section 6.1 –According to this, it seems that the notion of look-behind appeared in –It exceeds the frame of context-free grammar! –The matching problem of it is proved to be NP-complete.


Download ppt "Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching""

Similar presentations


Ads by Google