Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lexical Analysis Dragon Book: chapter 3.

Similar presentations


Presentation on theme: "Lexical Analysis Dragon Book: chapter 3."— Presentation transcript:

1 Lexical Analysis Dragon Book: chapter 3

2 Intermediate code generator
Compiler structure Source program Lexical analyzer Syntax analyzer Error handling Symbol table Semantic analyzer Intermediate code generator Code optimizer Code generator Target program

3 Compiler structure Lexical analyzer Syntax analyzer Error handling
Source program Lexical analyzer token Get next token Syntax analyzer Error handling Symbol table

4 Tokens in programming languages
Sample instances Description if id keyword rel <, <=, <>, >=, > relation count, length, point2 variable num , 7, 145e-3 Numerical constant str “abc”, “some space” “\7\” is a char” Constant string

5 Tokens may be difficult to recognize
Fortran: DO 5 I= DO 5 I=1,25 (spaces do not count). PL/I: IF THEN THEN THEN=ELSE; ELSE ELSE=THEN; (no reserved keywords). PL/I: PR1(2, 7, 18, D*3, )=3 (proc. call or array reference).

6 Strings, languages. A sequence of characters over some alphabet, e.g., over {0, 1}. In computers, usually ASCII or EBCDIC. Length of strings: number of characters. Empty string:  (size 0). Concatenation: putting one string after another. X=dog, Y=house, XY=doghouse (also X.Y). Prefix: ban is prefix of banana. Suffix: ana is prefix of banana.

7 Language: a set of strings
The alphabet is a language: L={A, B, …, Z, a, b, …, z}. Constant languages: X={ab, ba}, Y={a}. Concatenation: X.Y = {aba, baa} Y.X = {aab, aba}. Union: XY=X+Y=X|Y={ab, ba, a}. Exponentation: X3 = X.X.X Star: X* = zero or more occurrences. L* = all words with letters from L. L+= all words with one or more letters from L.

8 Regular expressions X|Y = XY= { s | sX or sY }.
X.Y = { x.y | xX and yY }. X* = i=0, Xi. X+ = i=1, Xi.

9 Examples a|b = {a, b}. (a|b).(a|b) = {aa, ab, ba, bb}.
a* = { , a, aa, aaa, … }. (a|b)* = { , a, b, ab, ba, aa, aba, … }

10 Defining tokens digit  [0-9] digits  digit+ fraction  . digits | 
exponent  E ( + | - |  ) digits |  const  digits fraction exponent

11 Not everything is regular!
All the words of the form w c w, where w is a word and c a letter. The syntax of a program, e.g., the recursive definition of if-then-else. stmtif expr then stmt else stmt.

12 If a>8 then goto nextloop else begin while z>8 do
Reading the input If a>8 then goto nextloop else begin while z>8 do Token starts here Last character read Need sometimes to “lookahead”. For example: identifying the variable done. May need to “unread” a character.

13 Returning: token + attributes.
if xyz > 11 then if, keyword id, value=xyz op, value=“>”. const, value=11 then, keyword.

14 Finite Automata Includes: States {s1,s2,…,s5}. Initial states {s1}.
Accepting states {s3,s5}. Alphabet {a, b, c}. Transitions: {(s1,a,s2), (s2, a, s3), …}. s1 a b b s2 b a s5 c b a c s4 s3 a Deterministic?

15 Automaton. What is the language?
b s0 a b s1 Formally: An input is a word over the alphabet . A run over a word is an alternating sequence of states and letters, starting from the initial state. Accepting run: ends with an accepting state.

16 Example s0 s1 b a b Input: aabbb
Run: s0 a s0 a s0 b s1 b s1 b s1. Accepts. Input: aba Run: s0 a s0 b s1 a s0. Does not accept.

17 Automaton. What is the language?
b s0 s1 a b a

18 Automaton. What is the language?
b s1 a s0 b a

19 Identifying tokens F I T H E N E L S E letter letter|digit

20 Non deterministic automata
0,1 1 s3 Allows more than a single transition from a state with the same label. There does not have to be a transition from every state with every label. Allows multiple initial states. Allows  transitions.

21 Nondeterministic runs
0,1 1 s3 Input: 0100 Run 1: s0 0 s0 1 s0 0 s0 0 s0. Does not accept. Run 2: s0 0 s0 1 s1 0 s2 0 s3. Accepts. Accepts when there exists an accepting run.

22 Determinizing Automata
s0 s1 s2 0,1 1 s3 Each state of D is a set of the states of N. S—aT when T={t|sS and s—at}. The initial state of D includes all the initial states of N. Accepting states in D include at least one accepting state of N.

23 Determinization s0 s1 s2 s3 1 0,1 0,1 0,1 s0 s0,s3 s0,s2 s0,s1 1 1
1 1

24 Determinization 000 100 010 101 110 111 011 001 1 1

25 Translating regular expressions into automata
L1L2 L2 L1.L2 L1 L2 L L*

26 Automatic translation
(a|b).(a.b)=(ab)(ab)=(a+b).(a+b)=… a a b b a a b b

27 Determinization with  transitions.
b b s8 s10 s2 s4 Add to each set states reachable using  transitions. s0,s1,s2 s3,s5,s6,s7,s8 s9,s11 s4,s5,s6,s7,s8 s10,s11 a b

28 Minimization a b p1 p3 p0 p2 p4  Group all the states together.
 Separate states according to available exit transitions.  Separate a set to two if from some of its states one can reach another set and with others one cannot. Repeat until cannot separate.

29 Minimization a b p1 p3 p0 p2 p4 Group all the states together.

30 Minimization p0 p1 p3 p2 p4 a b Separate states according to available exit transitions.

31 Minimization p0 p1 p3 p2 p4 a b  Separate a set to two if from some of its states one can reach another set and with others one cannot. Repeat until cannot separate.

32 Can minimize now a b a a b b

33 Lex Declarations %% Translation rules Auxiliary procedures

34 Lex behavior Lex Program a.out Lex source program lex.l lex.yy.c C
Compiler Lex source program lex.l lex.yy.c a.out a.out Input streem Output tokens

35 Lex behavior Translates the definitions into an automaton.
The automaton looks for the longest matching string. Either return some value to the reading program (parser), or looks for next token. Lookahead operator: x/y  allow the token x only if y follows it (but y is not part of the token).

36 Lex Project Project collection date: Feb 11th.
Work in pairs (singles). Use lex to take a text and check whether the number of open parentheses of any kind is equal to the number of closed parentheses. Exception: Inside quotes. \” is not a closing quote.


Download ppt "Lexical Analysis Dragon Book: chapter 3."

Similar presentations


Ads by Google