Download presentation

1
**CPSC 388 – Compiler Design and Construction**

Scanners – Regular Expressions

2
**Announcements Last day to Add/Drop Sept 11**

Wipe Down Computer Keyboards and Mice ACM programming contest Read Chapter 3 Homework 2 due this Friday PROG1 due this Friday FSA for Java string constants (Anybody figure it out?)

3
Homework 1 returned Summary – In addition to your summary please answer the following questions in your report: Context – What is the context of the research presented in the paper? What new ideas or concepts do you think were presented in the paper? Evaluation – How was the research evaluated? What evaluation techniques would you like to see to compare the research to the state of the art before the paper? Significance – What is the significance of the research? Do you feel the work is a minor improvement or a major step in the field? Grammar / spelling / clarity / support for statements made

4
**Scanner Generator Scanner Generator .Jlex file Containing**

Regular Expressions .java file Containing Scanner code To understand Regular Expressions you need to understand Finite-State Automata

5
**FSA Formal Definition (5-tuple)**

Q – a finite set of states Σ – The alphabet of the automata (finite set of characters to label edges) δ – state transition function δ(statei,character) statej q – The start state F – The set of final states

6
**Types of FSA Deterministic (DFA) Non-Deterministic (NFA)**

No State has more than one outgoing edge with the same label Non-Deterministic (NFA) States may have more than one outgoing edge with same label. Edges may be labeled with ε, the empty string. The FSA can take an epsilon transition without looking at the current input character.

7
Terms to Know Alphabet (Σ) – any finite set of symbols e.g. binary, ASCII, Unicode String – finite sequence of symbols e.g , banana, bãër Language – any countable set of strings e.g. Empty set Well-formed C programs English words

8
Regular Expressions Easy way to express a language that is accepted by FSA Rules: ε is a regular expression Any symbol in Σ is a regular expression If r and s are any regular expressions then so is: r|s denotes union e.g. “r or s” rs denotes r followed by s (concatination) (r)* denotes concatination of r with itself zero or more times (Kleene closer) () used for controlling order of operations

9
**Example Regular Expressions**

Corresponding Language ε {“”} a {“a”} abc {“abc”} a|b|c {“a”,”b”,”c”} (a|b|c)* {“”,”a”,”b”,”c”,”aa”,”ab”,”ac”,”aaa”,…} a|b|c|…|z|A|B|…|Z Any letter 0|1|2|…|9 Any digit

10
**Precedence in Regular Expressions**

* has highest precedence, left associative Concatenation has second highest precedence, left associative | has lowest associative, left associative

11
**More Regular Expression Examples**

Corresponding Language ε|a|b|ab* {“”, “a”, “b”, “ab”, “abb”, “abbb”,…} ab*c {“ac”, “abc”, “abbc”,…} ab*|a* {“”, “a”, “ab”, “aa”, “aaa”, “abb”,…} a(b*|a*) {“a”, “ab”, “aa”, “abb”, “aaa”, …} a(b|a)* {“a”, “ab”, “aa”, “aaa”, “aab”, “aba”,…}

12
**You Try What is the language described by each Regular Expression? a***

(a|b)* a|a*b (a|b)(a|b) aa|ab|ba|bb (+|-|ε)(0|1|2|3|4|5|6|7|8|9)*

13
Regular Definitions If Σ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form: D1 → R1 D2 → R2 … Dn → Rn Each di is a new symbol not in Σ and not the same as any other of the d’s. Each ri is a regular expression over Σ U (d1,d2,…,di-1)

14
**Regular Definitions Example**

Example C identifiers: Σ = ASCII letter_ → a|b|c|…|z|A|B|C|…|Z|_ digit → 0|1|2|…|9 id → letter_(letter_|digit)*

15
**Regular Definitions Example**

Example Unsigned Numbers (integer or float): Σ = ASCII digit → 0|1|2|…|9 digits → digit digit* optionalFraction → . digits | ε optionalExponent → (E(+|-| ε)digits)| ε number → digits optionalFraction optionalExponent

16
**Special Characters in Reg. Exp.**

What does each of the following mean? * | () [] + ? “” . \ ^ $ – Kleene Closure – or – grouping – creates a character class – Positive Closure – zero or one instance – anything in quotes means itself, e.g. “*” – matches any single character (except newline) – used for escape characters (newline, tab, etc.) – matches beginning of a line – matches the end of a line

17
**Extensions to Regular Expressions**

+ means one or more occurrence (positive closure) ? means zero or one occurrence Character classes a|r|t can be written [art] a|b|…|z can be written [a-z] As long as there is a clear ordering to characters [^a-z] matches any character except a-z

18
**Example Using Character Classes**

^[^aeiou]*$ Matches any complete line that does not contain a lowercase vowel How do you tell which meaning of ^ is intended?

19
**Try It Create Character Classes for: Create Regular Expressions for:**

First ten letters (up to “j”) Lowercase consonants Digits in hexadecimal Create Regular Expressions for: Case Insensitive keyword such as SELECT (or Select or SeLeCt) in SQL Java string constants Any string of whitespace characters

20
Creating a Scanner Create a set of regular expressions, one for each token to be recognized Convert regular expressions into one combined DFA Run DFA over input character stream Longest matching regular expression is selected If a tie then use first matching regular expression Attach code to run when a regular expression matches

Similar presentations

© 2019 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google