# Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.

## Presentation on theme: "Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of."— Presentation transcript:

Regular Expressions Finite State Automaton

Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of symbols –string : a finite sequence of alphabet symbols –language : a (finite or infinite) set of strings.  Regular Operations on languages: Union: R  S = { x | x  R or x  S} Concatenation: RS = { xy | x  R and y  S} Kleene closure: R* = R concatenated with itself 0 or more times = {  }  R  RR  RRR  = strings obtained by concatenating a finite number of strings from the set R.

Programming Languages3 Regular Expressions A pattern notation for describing certain kinds of sets over strings: Given an alphabet  : –  is a regular exp. (denotes the language {  }) –for each a  , a is a regular exp. (denotes the language {a}) –if r and s are regular exps. denoting L(r) and L(s) respectively, then so are: (r) | (s) ( denotes the language L(r)  L(s) ) (r)(s) ( denotes the language L(r)L(s) ) (r)* ( denotes the language L(r)* )

Programming Languages4 Common Extensions to r.e. Notation  One or more repetitions of r : r+  A range of characters : [a-zA-Z], [0-9]  An optional expression: r?  Any single character:.  Giving names to regular expressions, e.g.: –letter = [a-zA-Z_] –digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 –ident = letter ( letter | digit )* –Integer_const = digit+

Programming Languages5 Examples of Regular Expressions Identifiers: Letter  (a|b|c| … |z|A|B|C| … |Z) Digit  (0|1|2| … |9) Identifier  Letter ( Letter | Digit ) * Numbers: [0-9][0-9]+[0-9]* [1-9][0-9]* ([1-9][0-9]*)|0 -?[0-9]+ [0-9]*\.[0-9]+([0-9]+)|([0-9]*\.[0-9]+) [eE][-+]?[0-9]+([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)? -?( ([0-9]+) | ([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)? )

Programming Languages6 Examples of Regular Expressions Numbers: Integer  (+|-|  ) (0| (1|2|3| … |9)(Digit * ) ) Decimal  Integer. Digit * Real  ( Integer | Decimal ) E (+|-|  ) Digit * Complex  ( Real, Real )

Exercise of Regular Expressions  문자열 [a-z][a-zA-Z][a-zA-Z0-9] [a-zA-Z][a-zA-Z0-9]*  스트링 –"this is a string" \".*\"<- wrong!!! why? \"[^"]*\"  몇가지 연습 0 과 1 로 이루어진 문자열 중에서... –0 으로 시작하는 문자열 0[01]* –0 으로 시작해서 0 으로 끝나는 문자열 0[01]*0 –0 과 1 이 번갈아 나오는 문자열 ______ –0 이 두번 계속 나오지 않는 문자열 ______

Programming Languages8 Recognizing Tokens: Finite Automata A finite automaton is a 5-tuple (Q, , T, q0, F), where: –  is a finite alphabet; –Q is a finite set of states; –T: Q    Q is the transition function; –q0  Q is the initial state; and –F  Q is a set of final states.

Programming Languages9 Finite Automata: An Example A (deterministic) finite automaton (DFA) to match C-style comments:

Programming Languages10 Consider the problem of recognizing register names Register  r (0|1|2| … | 9) (0|1|2| … | 9) *  Allows registers of arbitrary number  Requires at least one digit RE corresponds to a recognizer (or DFA ) Transitions on other inputs go to an error state, s e Example 2 S0S0 S2S2 S1S1 r (0|1|2| … 9) accepting state (0|1|2| … 9) Recognizer for Register

Programming Languages11 DFA operation  Start in state S 0 & take transitions on each input character  DFA accepts a word x iff x leaves it in a final state (S 2 ) So,  r17 takes it through s 0, s 1, s 2 and accepts  r takes it through s 0, s 1 and fails  a takes it straight to s e Example 2 ( continued ) S0S0 S2S2 S1S1 r (0|1|2| … 9) accepting state (0|1|2| … 9) Recognizer for Register

Programming Languages12 Example 2 (continued) To be useful, recognizer must turn into code sese sese sese sese sese s2s2 sese s2s2 sese s2s2 sese s1s1 sese sese s1s1 s0s0 All others0,1,2,3,4,5,6, 7,8,9r  Char  next character State  s 0 while (Char  EOF) State   (State,Char) Char  next character if (State is a final state ) then report success else report failure Skeleton recognizer Table encoding RE

Programming Languages13 r Digit Digit * allows arbitrary numbers  Accepts r00000  Accepts r99999  What if we want to limit it to r0 through r31 ? Write a tighter regular expression –Register  r ( (0|1|2) (Digit |  ) | (4|5|6|7|8|9) | (3|30|31) ) –Register  r0|r1|r2| … |r31|r00|r01|r02| … |r09 Produces a more complex DFA  Has more states  Same cost per transition  Same basic implementation What if we need a tighter specification?

Programming Languages14 Tighter register specification (continued) The DFA for Register  r ( (0|1|2) (Digit |  ) | (4|5|6|7|8|9) | (3|30|31) )  Accepts a more constrained set of registers  Same set of actions, more states S0S0 S5S5 S1S1 r S4S4 S3S3 S6S6 S2S2 0,1,20,1,2 3 0,10,1 4,5,6,7,8,94,5,6,7,8,9 (0|1|2| … 9)

Programming Languages15 sese sese sese sese sese S1S1 s0s0 sese sese sese sese sese sese sese sese sese sese sese sese sese s6s6 sese sese sese sese s6s6 sese s5s5 sese sese sese sese sese sese s4s4 sese sese sese sese sese sese s3s3 sese s3s3 s3s3 s3s3 s3s3 sese s2s2 sese s4s4 s5s5 s2s2 s2s2 sese s1s1 All others4-9320,1r  Table encoding RE for the tighter register specification Tighter register specification (continued)

Programming Languages16 Automating Scanner Construction  RE→ NFA (Thompson’s construction) –Build an NFA for each term –Combine them with ε-moves  NFA → DFA (subset construction) –Build the simulation  DFA → Minimal DFA –Hopcroft’s algorithm  DFA →RE (Not part of the scanner construction) –All pairs, all paths problem –Take the union of all paths from s0 to an accepting state

Download ppt "Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of."

Similar presentations