Download presentation

Presentation is loading. Please wait.

1
Lexical Analysis Dragon Book: chapter 3

2
**Intermediate code generator**

Compiler structure Source program Lexical analyzer Syntax analyzer Error handling Symbol table Semantic analyzer Intermediate code generator Code optimizer Code generator Target program

3
**Compiler structure Lexical analyzer Syntax analyzer Error handling**

Source program Lexical analyzer token Get next token Syntax analyzer Error handling Symbol table

4
**Tokens in programming languages**

Sample instances Description if id keyword rel <, <=, <>, >=, > relation count, length, point2 variable num , 7, 145e-3 Numerical constant str “abc”, “some space” “\7\” is a char” Constant string

5
**Tokens may be difficult to recognize**

Fortran: DO 5 I= DO 5 I=1,25 (spaces do not count). PL/I: IF THEN THEN THEN=ELSE; ELSE ELSE=THEN; (no reserved keywords). PL/I: PR1(2, 7, 18, D*3, )=3 (proc. call or array reference).

6
Strings, languages. A sequence of characters over some alphabet, e.g., over {0, 1}. In computers, usually ASCII or EBCDIC. Length of strings: number of characters. Empty string: (size 0). Concatenation: putting one string after another. X=dog, Y=house, XY=doghouse (also X.Y). Prefix: ban is prefix of banana. Suffix: ana is prefix of banana.

7
**Language: a set of strings**

The alphabet is a language: L={A, B, …, Z, a, b, …, z}. Constant languages: X={ab, ba}, Y={a}. Concatenation: X.Y = {aba, baa} Y.X = {aab, aba}. Union: XY=X+Y=X|Y={ab, ba, a}. Exponentation: X3 = X.X.X Star: X* = zero or more occurrences. L* = all words with letters from L. L+= all words with one or more letters from L.

8
**Regular expressions X|Y = XY= { s | sX or sY }.**

X.Y = { x.y | xX and yY }. X* = i=0, Xi. X+ = i=1, Xi.

9
**Examples a|b = {a, b}. (a|b).(a|b) = {aa, ab, ba, bb}.**

a* = { , a, aa, aaa, … }. (a|b)* = { , a, b, ab, ba, aa, aba, … }

10
**Defining tokens digit [0-9] digits digit+ fraction . digits | **

exponent E ( + | - | ) digits | const digits fraction exponent

11
**Not everything is regular!**

All the words of the form w c w, where w is a word and c a letter. The syntax of a program, e.g., the recursive definition of if-then-else. stmtif expr then stmt else stmt.

12
**If a>8 then goto nextloop else begin while z>8 do**

Reading the input If a>8 then goto nextloop else begin while z>8 do Token starts here Last character read Need sometimes to “lookahead”. For example: identifying the variable done. May need to “unread” a character.

13
**Returning: token + attributes.**

if xyz > 11 then if, keyword id, value=xyz op, value=“>”. const, value=11 then, keyword.

14
**Finite Automata Includes: States {s1,s2,…,s5}. Initial states {s1}.**

Accepting states {s3,s5}. Alphabet {a, b, c}. Transitions: {(s1,a,s2), (s2, a, s3), …}. s1 a b b s2 b a s5 c b a c s4 s3 a Deterministic?

15
**Automaton. What is the language?**

b s0 a b s1 Formally: An input is a word over the alphabet . A run over a word is an alternating sequence of states and letters, starting from the initial state. Accepting run: ends with an accepting state.

16
**Example s0 s1 b a b Input: aabbb**

Run: s0 a s0 a s0 b s1 b s1 b s1. Accepts. Input: aba Run: s0 a s0 b s1 a s0. Does not accept.

17
**Automaton. What is the language?**

b s0 s1 a b a

18
**Automaton. What is the language?**

b s1 a s0 b a

19
Identifying tokens F I T H E N E L S E letter letter|digit

20
**Non deterministic automata**

0,1 1 s3 Allows more than a single transition from a state with the same label. There does not have to be a transition from every state with every label. Allows multiple initial states. Allows transitions.

21
**Nondeterministic runs**

0,1 1 s3 Input: 0100 Run 1: s0 0 s0 1 s0 0 s0 0 s0. Does not accept. Run 2: s0 0 s0 1 s1 0 s2 0 s3. Accepts. Accepts when there exists an accepting run.

22
**Determinizing Automata**

s0 s1 s2 0,1 1 s3 Each state of D is a set of the states of N. S—aT when T={t|sS and s—at}. The initial state of D includes all the initial states of N. Accepting states in D include at least one accepting state of N.

23
**Determinization s0 s1 s2 s3 1 0,1 0,1 0,1 s0 s0,s3 s0,s2 s0,s1 1 1**

1 1

24
Determinization 000 100 010 101 110 111 011 001 1 1

25
**Translating regular expressions into automata**

L1L2 L2 L1.L2 L1 L2 L L*

26
**Automatic translation**

(a|b).(a.b)=(ab)(ab)=(a+b).(a+b)=… a a b b a a b b

27
**Determinization with transitions.**

b b s8 s10 s2 s4 Add to each set states reachable using transitions. s0,s1,s2 s3,s5,s6,s7,s8 s9,s11 s4,s5,s6,s7,s8 s10,s11 a b

28
**Minimization a b p1 p3 p0 p2 p4 Group all the states together.**

Separate states according to available exit transitions. Separate a set to two if from some of its states one can reach another set and with others one cannot. Repeat until cannot separate.

29
**Minimization a b p1 p3 p0 p2 p4 Group all the states together.**

30
Minimization p0 p1 p3 p2 p4 a b Separate states according to available exit transitions.

31
Minimization p0 p1 p3 p2 p4 a b Separate a set to two if from some of its states one can reach another set and with others one cannot. Repeat until cannot separate.

32
Can minimize now a b a a b b

33
Lex Declarations %% Translation rules Auxiliary procedures

34
**Lex behavior Lex Program a.out Lex source program lex.l lex.yy.c C**

Compiler Lex source program lex.l lex.yy.c a.out a.out Input streem Output tokens

35
**Lex behavior Translates the definitions into an automaton.**

The automaton looks for the longest matching string. Either return some value to the reading program (parser), or looks for next token. Lookahead operator: x/y allow the token x only if y follows it (but y is not part of the token).

36
**Lex Project Project collection date: Feb 11th.**

Work in pairs (singles). Use lex to take a text and check whether the number of open parentheses of any kind is equal to the number of closed parentheses. Exception: Inside quotes. \” is not a closing quote.

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google