 # 79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.

## Presentation on theme: "79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression."— Presentation transcript:

79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression. (2)  is a regular expression and denotes the set {  }. (3) For every a  , a is a regular expression and denotes the set {a}. (4) If r and s are regular expressions that denote the sets R and S, respectively, then (r + s), ( rs ), and ( r * ) are regular expressions that denote, respectively, the sets R  S, RS, and R *. We may omit parentheses from a regular expression if it expresses the same set under the assumption that the star * has higher precedence than concatenation or +, and that concatenation has higher precedence than +. For a regular expression r, by L(r) we denote the set of strings that is expressed by regular expression r. While studying formal languages, we often expresse the languages in terms of set notation, like {a i b i | i > 0}. Such set notation is practical only when the language property is simple enough to describe. However, all regular languages can be expressed succinctly in terms of a regular expression, which is defined as follow.

80 Regular expression (cont’ed) For example regular language {a i b j | i, j  0} can be expressed in regular expression a * b *, and regular language {xaaybbz | x, y, z  {a, b} * } can be expressed as (a+b) * aa(a+b) * bb(a+b) *. We can easily prove that a * b * is a regular expression according to the definition; since a and b is regular expressions, respectively, denoting the sets {a} and {b}, by definition part (4) expressions a * and b * are regular expressions, which denote, respectively, the sets {a} * and {b} *. Since a * and b * are regular expression, the concatenation a * b * is also a regular expression by definition part (4), which denotes the set {a} * {b} *, which is equivalent to {a i b j | i, j  0}. By similar argument we can show that (a+b) * aa(a+b) * bb(a+b) * is a regular expression which denotes the regular language above. Later we will see that every regular language can be expressed in a regular expression, and if a language is expressible in a regular expression, then that language is regular.

81 Chomsky Hierarchy of Languages and Related Models We have studied four types of formal grammars and their languages, and four different computational models that recognize the languages, together with other related models, such as L-systems, syntax flow graph, and regular expressions. Now we will study more closely about their relationships. The table on the next page summarizes the relationship among those models. This relationship, called the Chomsky hierarchy (after Noam Chomsky, who defined the classes of languages) is one of the most significant achievement in computer science. In the table the vertical relationship  denotes proper containment and the horizontal relationship  denotes the characterizations. For example, the class of context-free languages properly contains regular languages, finite state machines can only recognize regular languages, and the languages recognized by finite state machines can be expressed by regular expressions. Many powerful models have been introduced (for example, the ones shown at upper right corner), which turned out to be computationally equivalent to the Turing machines and their languages, also called recursively enumerable sets.

82 The Chomsky Hierarchy Recursively Enumerable Sets (type 0) Turing Machines Post System, Markov Algorithms,  -recursive Functions Regular Expression Context-sensitive Languages(type 1) Context-free Languages(type 2) Regular Languages(type3) Linear-bounded Automata Pushdown Automata Finite State Automata.......... Languages (grammars) MachinesOther Models

83 Characterization Theorem among Regular Grammars, FA’s and Regular Expressions We only prove the characterization (i.e., horizontal relationship) at the level of regular languages, and later prove the vertical relations for the lower two levels only. Theorem. (1) A language L is regular if and only if it is accepted by an FA M. (2) A language L can be expressible in terms of a regular expression if and only if L is accepted by an FA M. Proof of (1-a): If L is regular, then there is an FA M which accepts L. We construct an FA M with any regular grammar G whose language is L. Without loss of generality, assume G has production rules of the form A  xB or A  x, where x is  or a single terminal symbol, i.e., |x| = 1. Otherwise, we can easily convert the rules into these restricted forms without affecting the language of the grammar. For example, if there is rule A  abbB in a grammar, this rule can be converted to a set of rules as follows without changing the language, where B i are new non-terminal symbols. A  aB 1 B 1  bB 2 B 2  bB A  abbB is equivalent to

84 Suppose the grammar is given as G = (V T, V N, P, S), We construct an FA M from G using the rules shown below. Let A, B  V N and a  V T  {  }. Proof of Characterization Theorem(cont’ed) a a a AB A F F is a new accepting state A A is an accepting state Let A be the start state A start We can prove that L(G) = L(M), i.e., the language accepted by M is exactly the language generated by the grammar G. A  aB | aA A  a A   If A is the start symbol For each production rule the following type Construct a state transition in M as follows: 

85 a AB b A  bB | aA Define A as the start symbol. A A A   Production rule of GState transition of M start Proof of (1-b): If L is the language accepted by an FA M, then there is a regular G which generates L. Let M = ( Q, , , q 0, F ). Construct a regular grammar G from M according to the rules shown blow, where A, B  Q and a, b   {  }. Proof of Characterization Theorem(cont’ed) 

86 Characterization Theorem(examples) Example 1. (Regular grammar  FA): S  aS | bbcB B  bA | a A  aS | bB |  S A B a b b c b b a a  b a a b b a a a c Example 2. (FA  regular grammar): Name the states c b A C a a b b a a a E S B D S  aS | aA A  bB B  bB | bS | aD |  D  aC C  aB | cE E   Transform to grammar

87 Proof of Characterization Theorem(cont’ed) Going along the definition of regular expression, we show how to construct an FA for a given regular expression. (This is proof by induction.) Assume that the alphabet is . 1. If the regular expression is , , or a  , which respectively denote the empty set, {  }, and {a}. Then for each case we construct the following FA. 2. Suppose that for regular expressions r 1 and r 2, we have constructed FA M 1 And M 2, which recognize the language expressed by r 1 and r 2, respectively. Then we can construct FA M 1+2, M 12, and M 1 * which respectively recognize the languages expressed by regular expressions r 1 + r 2, r 1 r 2, and (r 1 ) *, as follows: start a   a Proof of (2)-(a): If a language L can be expressible in terms of a regular expression, then L is accepted by an FA M.

88 Proof of Characterization Theorem(cont’ed) M2M2 M1M1   M 1+2 If L(M 1 ) = L( r 1 ) and L(M 2 ) = L(r 2 ), then L(M 1+2 ) = L( r 1 + r 2 ), L(M 12 ) = L(r 1 r 2 ), and L(M 1 * ) = L((r 1 ) * ). New start M1 M1 M2M2  M 12 start M1M1     M1*M1* New start

89 Proof of Characterization Theorem(cont’ed) Definition: Generalized state transition graph. For all strings expressed by a regular expression r, if an FA M takes transition from a state p to a state q, we write  (p, r) = q, and draw state transition as the following Figure (a) shows. Figure (b) is an example. p q r p q (ab+c) * Figure (a)Figure (b) The state transition graphs of M can be considered as a generalized state transition graphs of special case, where each edge label has a regular expression expressing one string of length 1 or zero (for the case of  transition). By further generalizing , for a path label w = r 1 r 2 …r i (i.e., a concatenated sequence of regular expressions), let  (p, w) = q denotes the sequence of transitions along a path with labels of regular expressions r 1, r 2, …, r i. Proof of (2)-(b): If L is a language L accepted by an FA M, then L can be expressible in terms of a regular expression.

90 For a generalized state transition graph G, let L(G) be the set of strings defined as follows, where q 0 is the start state and F is the set of accepting states. Clearly L(G) = L(M). L(G) = {x | x  L(w), w is a path label such that  (q 0, w) = q f  F } Given a generalized state transition graph G of an FA, we can eliminate a state from G, and transform it to another generalized state transition graph G' such that L(G) = L(G'). Suppose that q is a non-accepting state in a state transition graph G. Suppose q has a self-loop, and is on a path between its two neighboring states r and s as shown in figure (a) below. (Dotted arrows indicate other possible transitions.) State q can be eliminated and generalized transitions can be added without changing the language of the automaton as figure (b) shows. af * b af * c df * c df * b r s r q s a b c d f (a) G (b) G'

91 Now, we give an example for transforming a state transition graph G into a regular expression using the above technique. Consider an FA whose state transition graph is shown in figure (a) below. Clearly, if an automaton has k  1 Accepting states, then the language of the automaton is the union of the languages accepted by k accepting states. So we compute a regular expression r i for the language L i accepted by each of the k accepting state, and find the regular expression for the language of the automaton; r = r 1 + r 2 +.... + r k For example, the language accepted by the automaton shown below is the union of the languages accepted by state 0 and 1. a start 1 3 2 0 4 b a a b a  b b b b  (a) ba b (b) b a start 1 3 4 b a a b b  b 0

92 For this example, we first compute the regular expression for the language accepted by state 4 by changing state 0 to non-accepting state. Leaving the start state and the accepting state, we eliminate all other states, one at a time. Eliminating state 2 will give the generalized state transition graph shown in (b). We could eliminated state 1 or 3 first. In general it is better to choose a state which does not induce too many new links. Before eliminating state 3, we merge links which have the same origin and destination using the + operator, and get figure (c) below. a ba start 1 3 0 4 b a a b b b  (b) ba b b start ba+a 1 3 0 4 b a a b b b (c) ba b+  b

93 Eliminating state 3 gives the graph shown in figure (d), and start ba+a 1 3 0 4 b a a b b b (c) ba b+  b start ba+a 1 0 4 b a b b (d) b+  bba ba bb

94 start ba+a 1 0 4 b a b+bb b (d) b+  bba ba Finally eliminating state 1 we get the graph in figure (e). Notice that regular expression b+bb on the self-loop of state 4 has been simplified to b, because looping on b or bb is equivalent to looping on b. 4b ba a(ba+a) * b start 0 b a(ba+a) * (b+  ) bba(ba+a) * (b+  ) bba(ba+a) * b (e)

95 4b ba a(ba+a) * b start 0 b a(ba+a) * (b+  ) bba(ba+a) * (b+  ) bba(ba+a) * b (e) By merging edges which have the same origin and destination, we get the final transition graph (f), from which we can construct a regular expression r 4 whose language is exactly the language accepted by state 4. 4 a(ba+a) * b+b start 0 a(ba+a) * (b+  ) bba(ba+a) * (b+  )+b bba(ba+a) * b+ba (f)

96 4 a(ba+a) * b+b start 0 a(ba+a) * (b+  ) bba(ba+a) * (b+  )+b bba(ba+a) * b+ba (f) In general suppose a generalized transition graph with the start state and an accepting state is given with each edge labeled with a regular expression as shown in figure (g) below. Then regular expression r 2 shown in the figure expresses the language accepted by the automaton. 2 1 start r 11 r 22 r 12 r 21 r 2 = (r 11 ) * r 12 (r 22 + r 21 (r 21 ) * r 12 ) * (g) By substituting r ij in the expression in figure (g) with corresponding regular expression from figure (f), we get the regular expression r 4 for the language accepted by state 4.

97 Now to construct a regular expression for the language accepted by the other accepting state, which is the start state, we can start with figure (f) by changing the start state back to accepting state and state 4 to non-accepting state as shown in figure (h). This is the general case as shown in figure (i) whose regular expression can be given as r 1 in the figure. Substituting corresponding regular expressions from figure (h), we get a regular expression r 0 which denotes the language accepted by state 0. Finally we get a regular expression r = r 0 + r 4 which denotes the language accepted by automaton M. 0 4 a(ba+a) * b+b start a(ba+a) * (b+  ) bba(ba+a) * (b+  )+b bba(ba+a) * b+ba (h) 1 2 start r 11 r 22 r 12 r 21 r 1 = (r 11 + r 12 (r 22 ) * r 21 ) * (i)

Download ppt "79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression."

Similar presentations