Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.

Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains strings that are generated using simple recursive rules. The languages represented by regular expressions are called regular languages. Some examples of regular languages are given next before we see the precise definition. Example. The set of integer constants used in a typical programming language: an integer constant contains a sequence of one or more decimal digits, up to some system-dependent maximum length. Example. variable name: starting with a letter, followed by zero or more letter of digit characters, or certain punctuation symbols (such as the underscore “_”, dash “-”), up to certain maximum length.

Regular languages: More precisely, let A be an alphabet (i.e., a finite, non-empty set of symbols). The collection of regular languages over A is defined by the following (recursive) rules: (i. Base step) The empty set , {  }, and {a} for every a  A are regular languages; (ii. Recursive step) If X and Y are regular languages, then the sets X  Y (union), X · Y (concatenation), and X* (Kleene star) are regular languages; (iii. Closure) No other sets are a regular language unless they are the results of applying the base step followed by zero or more recursive steps. Note that the above rules define not just one regular language; they define an infinite collection of languages (over an alphabet) and call each a regular language over that alphabet.

Example. Let A = {a, b} be an alphabet. The following are some examples of regular languages over A: (a) , {  }, {a}, {b} (Rule (i)); (b) {ab, aa, aba} = {a}{b}  {a}{a}  {a}{b}{a} (Rule (i), and Rule (ii) for concatenations and unions); (c) {a n | n  0} = {a}* (Rule (i), and Rule (ii) for the Kleene star operation); (d) {a n b 2 | n  0} = {a}*{b}{b} (Rule (i), and Rule (ii) for star and concatenation); (e) {a n b 2 | n  0}  {b m a 2 | m  0} (the result of (d) and Rule (ii) for union). Note that any finite set of strings is a regular language, as demonstrated from (a) and (b) of the above. Also note the use of parentheses when necessary, e.g., ({a}*  {b})*.

Regular expressions: To simplify the notations of regular languages, and drawing analogy to arithmetic expressions used in algebra, we could replace the union symbol  with the plus sign +, drop the braces “{“ and “}” for sets, and use parentheses “(“ and “)” for grouping when necessary, such as the following: Example. The following are some regular expressions over alphabet A = {a, b}, and the corresponding regular languages: ExpressionLanguage a + ab {a, ab} (a + b)*bb {a, b}*{bb} ba(a + b)*ab {ba}{a, b}*{ab}  {  }

More precisely, we could define regular expressions by the following rules for both the notations and the sets they represent: Let A be an alphabet. (Basis) The “constants” , , and a are regular expressions for each a that belongs to the alphabet A. The languages that they represent are, respectively, L(  ) = {  }, L(  ) = , and L(a) = {a}.(Recursion) If E and F are regular expressions, then E + F, EF, and E*, are regular expressions. The languages they represent are, respectively, L(E + F) = L(E)  L(F), L(EF) = L(E)  L(F) (concatenation), and L(E*) = (L(E))*. (Closure) No other notations are a regular expression unless they are constructed by applying the base step followed by zero or more recursive steps. Note that the use of parentheses is for “grouping” purposes. Thus, (01)* means the set { , 01, 0101, …}, but 01* means the set {0, 01, 011, …}. In general, L((E)) = L(E).

Some examples of Regular expressions are as follows: (a) The language of integer constants as in C: (0+1+2+3+4+5+6+7+8+9) (0+1+2+3+4+5+6+7+8+9)* assuming we place no limit on the maximum length. (b) The set of strings over {a, b} that contain the substring aa: (a + b)*aa (a + b)* (c) The set of strings over {a, b} that contain exactly two occurrences of symbol a: b*ab*ab* (d) The set of strings over {a, b} that contain up to 2 symbols:  + a + b + aa + ab + ba + bb (e) The set of strings over {a, b} that begin with a, and have an even number of b: a(a*ba*b)*a*. (f) The set of strings over {a, b} that do not contain the substring aa: b*(abb*)*(  + a)

Laws and rules for manipulating regular expressions: Let L, M, and N denote regular expressions. (1) (Associative law) L(MN) = (LM)N. (This is true for any sets L, M, and N of strings. (2) (distributive laws of concatenation over union) L(M + N) = LM + LN and (M + N)L = ML + NL. (3) (Idempotent law) L + L = L. (This is the idempotent law for set union.) (4) (L*)* = L*. (Both sides contain all possible strings that are made up of strings of L.) (5)  * =  * = . (The Kleene star always contains the empty string . (6) (L*)L = L(L*). (Both sides equal L  L 2  L 3  …, which is denoted L + ) (7) (L*M*)* = (L + M)*.

Finite Automata and Regular Languages: Finite automata (DFA and NFA) and regular languages are equivalent in the sense that every regular language can be recognized (accepted) by a finite automata and, conversely, for every finite automata there is a regular language which is the language accepted by the finite automata. Since DFA and NFA are equivalent, we can prove their equivalence to regular languages in two parts: Let M be a DFA, and let L = L(M) be the language accepted by M. Prove there is a regular expression R such that L(R) = L. Let L = L(R) be a regular language of expression R. Prove there is an NFA M such that L(M) = L, where L(M) denotes the language accepted by M.

We will prove the second part first since it is easier. Specifically, we will use the following recursive rules to construct an NFA for each regular expression (for a fixed alphabet A). Since regular expressions are generated by recursive rules, it suffices to show that corresponding to the base step and to the recursive step, we can construct an NFA that accepts exactly the same language as they are being constructed. (Base step) The following NFAs correspond to the regular expressions , , and a, respectively, for a belongs to A. It is easy to verify that these NFAs are correct. Note that in each case, we are constructing an NFA with exactly one start state, one final (accepting) state, no arcs into the start state, and no arcs out of the final state; we call this the desirable property (for lack of a better term).  a

(Recursive step) Suppose E and F are regular expressions whose equivalent NFAs M and N have already been constructed, where both have the desirable property, the following diagrams show how to construct the NFAs corresponding to, respectively, expressions E + F, EF, and E*: Note that in each case, we add one or more  -transitions; the resulting NFAs also have the desirable property. M N MN M         NFA for E + F NFA for EF NFA for E*

Example. An NFA for the regular expression (0+1)*1(0+1). We follow (roughly) the following steps (a) – (e): Note that the NFAs of every step satisfy the desirable property. 0 1     (c) NFA for 0+1 01 (a) NFA for 0(b) NFA for 1 0 1  0 1  0 1       (d) NFA for (0+1)*       1     (e) NFA for (0+1)*1(0+1)

Construction of a regular expression equivalent to a DFA: Let A be a DFA with states labeled 1, 2, …, n. We assume state 1 is the (only) start state. The idea of the construction (or proof) is to demonstrate that for any two states i and j, we can construct a regular expression R ij such that it contains those strings that are made up of the labels of paths connecting node i to node j; thus, string w belongs to R ij if (i, w)  * (j,  ). We prove this assertion by recursive construction. First, define R (k) ij as the set of strings that are made up of the labels of paths connecting node i to node j passing through only states  k (i.e., each intermediate state of the path must have a label  k). ij R (k) ij all state labels  k

We now show how to use regular expressions to represent such sets (notations) R (k) ij, by using induction on k. (Basis) When k = 0. That is, consider labels of the paths that connect state i to state j without intermediate states, i.e. a direct arc (edge) from i to j if exist. There are two cases: (i) When i  j. Thus R (0) ij =  if there is no arc (transition) from state i to state j; else R (0) ij = a 1 + a 2 + …+ a m if there are transitions from state i to state j labeled a 1, a 2, …, a m. (ii) When i = j. Thus R (0) ij =  if there is no arc (transition) from state i to state j; else R (0) ij =  + a 1 + a 2 + …+ a m if there are transitions from state i to state j labeled a 1, a 2, …, a m. i j a1a1 amam i a1a1 amam

(Recursion) When k  1. We can express R (k) ij in terms of the notations with a smaller superscript. Specifically, R (k) ij = R (k  1) ij + R (k  1) ik (R (k  1) kk )* R (k  1) kj This is true because each path in the set R (k) ij can either avoid state k or passes through it one or more times. In the former case, those paths constitute the expression R (k  1) ij ; in the latter case, the subpaths that lead to state k the first time are represented by R (k  1) ik, followed by paths from state k to itself (zero or more times) represented by (R (k  1) kk )*, finally continued with paths from state k to state j represented by R (k  1) kj. The following diagram illustrates the idea: ikkkkk R (k  1) ik (R (k  1) kk )* j R (k  1) kj

Notice that these R (k) ij are all regular expression because their base step (when k = 0) start with regular expression and, in each step of the recursive rule, only regular expression operations (i.e., +, *, and concatenation) are used. To complete the proof that a DFA M can be converted to an equivalent regular expression, we construct R (n) 1j for each of the accepting states j. (Recall the states are labeled 1 through n, and state 1is the start state.). Then L(M) = the sum (i.e. the union) of these R (n) 1j ’s, where state j ranges over all accepting states of M. Example (p. 94 of the Text) Convert the following DFA to an equivalent regular expression. 12 0 10,1

We first apply the basis step (k = 0) and construct the following sets: R (0) 11 =  + 1; R (0) 12 = 0; R (0) 21 =  ; and R (0) 22 =  + 0 + 1; each corresponding the single-arc transitions from one state to another state. Using the recursive rule, we can now construct the R (k) ij ‘s with k = 1: Recursive ruleSimplified R (1) 11  + 1+(  + 1)(  + 1)*(  + 1)1* R (1) 12 0 + (  + 1) (  + 1)*01*0 R (1) 21  +  (  + 1)*(  + 1)  R (1) 22  + 0 + 1 +  (  + 1)*0  + 0 + 1 Note that laws such as  L =  and (  + L)* = L* are used during simplification. Since there is only one accepting state (state 2), we only need to construct R (k) 12 for k = 2. Thus, the equivalent regular expression is R (2) 12 = R (1) 12 + R (1) 12 (R (1) 22 )* R (1) 22 = 1*0 + 1*0(  + 0 + 1)*(  + 0 + 1) = 1*0(0+1)* after simplification.

Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.

Similar presentations

Presentation on theme: "Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.

Similar presentations

Presentation on theme: "Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains."— Presentation transcript:

Similar presentations

About project

Feedback