Specification of tokens using regular expressions

Specification of tokens using regular expressions

Strings and Languages An alphabet is any finite set of symbols.
Examples of symbols are letters, digits, and punctuation. The set {0,1} is the binary alphabet. A string over an alphabet is a finite sequence of symbols drawn from that alphabet. "sentence" and "word" are often used as synonyms for "string.“ The empty string, denoted , is the string of length zero. A language is any countable set of strings over some fixed alphabet. The set containing only the empty string, are languages {}.

Regular expression Regular expressions are an important notation for specifying lexeme patterns. They are effective in specifying those types of patterns that we need for tokens Regular expression notations for identifiers are identifier=letter (letter/digit)*

Regular expression construction rules
The regular expressions are built recursively out of smaller regular expressions, using the rules described below. Regular expression construction rules Є is a regular expression denoting {є}, that is, the language containing only the empty string If a is a symbol in ∑(alphabet), a is a regular expression denoting {a}, the language with only one string. If r and s are regular expressions denoting languages L ( r) and L(s ) respectively, then (r)|(s) is a regular expression denoting L( r) U L(s) (r).(s) is a regular expression denoting L( r). L(s) (r)* is a regular expression denoting (L(r ))*

Precedence of operations
The unary operator * has highest precedence and is left associative. Concatenation has second highest precedence and is left associative. | has lowest precedence and is left associative For any regular expressions R , S and T the following axioms holds R|S=S|R (| is commutative)‏ R|(S|T)=(R|S)|T (| is assosiative)‏ R(ST)=(RS)T (concatenation is assosiative)‏ R(S|T)=RS|RT (concatenation distributes over |)‏ ЄR=Rє=R (є is the identity for concatenation)‏ 5

The regular expression a|b denotes the language {a, b}.
(a|b)(a|b) denotes {aa, ah, ba, bb}, the set of all strings of length two over the alphabet. a* denotes the language consisting of all strings of zero or more a's, that is, { , a , a a , a a a , }. (a|b)* denotes the set of all strings consisting of zero or more instances of a or b a|a*b denotes the language {a, b, ab, aab, aaab,...}

Regular definition We may wish to give names to certain regular expressions and use those names in subsequent expressions as if the names were themselves symbols di-> ri e.g. for language of C identifiers letter_->A|B|…|Z|a|b|…|z|_ digit -> 0|1|…|9 id -> letter_(letter_|digit)*

Extension of regular expression
+ one/ more instance * zero/ more instance ? Zero/ one instance [ ] character classes e.g. [a-z] ws -> (blank|tab|newline)+ When ws is recognized , we do not return anything but restart to the character following white space

Specification of tokens using regular expressions

Similar presentations

Presentation on theme: "Specification of tokens using regular expressions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Specification of tokens using regular expressions

Similar presentations

Presentation on theme: "Specification of tokens using regular expressions"— Presentation transcript:

Similar presentations

About project

Feedback