Download presentation

Presentation is loading. Please wait.

Published byCole Kirby Modified over 2 years ago

1
Lexical Analysis Dragon Book: chapter 3

2
Compiler structure Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Source program Target program Symbol table Error handling

3
Compiler structure Lexical analyzer Syntax analyzer Source program Symbol table Error handling tokenGet next token

4
Tokens in programming languages Token Sample instances Description ifidkeyword rel, >=, >relation idcount, length, point2variable num , 7, 145e-3Numerical constant strabc, some space \7\ is a char Constant string

5
Tokens may be difficult to recognize Fortran: DO 5 I=1.25 DO 5 I=1,25 (spaces do not count). PL/I: IF THEN THEN THEN=ELSE; ELSE ELSE=THEN; (no reserved keywords). PL/I: PR1(2, 7, 18, D*3, )=3 (proc. call or array reference).

6
Strings, languages. A sequence of characters over some alphabet, e.g., over {0, 1}. In computers, usually ASCII or EBCDIC. Length of strings: number of characters. Empty string: (size 0). Concatenation: putting one string after another. X=dog, Y=house, XY=doghouse (also X.Y). Prefix: ban is prefix of banana. Suffix: ana is prefix of banana.

7
Language: a set of strings The alphabet is a language: L={A, B, …, Z, a, b, …, z}. Constant languages: X={ab, ba}, Y={a}. Concatenation: X.Y = {aba, baa}. Y.X = {aab, aba}. Union: X Y=X+Y=X|Y={ab, ba, a}. Exponentation: X 3 = X.X.X Star: X * = zero or more occurrences. L * = all words with letters from L. L + = all words with one or more letters from L.

8
Regular expressions X|Y = X Y= { s | s X or s Y }. X.Y = { x.y | x X and y Y }. X* = i=0, X i. X + = i=1, X i.

9
Examples a|b = {a, b}. (a|b).(a|b) = {aa, ab, ba, bb}. a* = {, a, aa, aaa, … }. (a|b)* = {, a, b, ab, ba, aa, aba, … }

10
Defining tokens digit [0-9] digits digit + fraction. digits | exponent E ( + | - | ) digits | const digits fraction exponent

11
Not everything is regular! All the words of the form w c w, where w is a word and c a letter. The syntax of a program, e.g., the recursive definition of if-then-else. stmt if expr then stmt else stmt.

12
Reading the input Need sometimes to lookahead. For example: identifying the variable done. May need to unread a character. If a>8 then goto nextloop else begin while z>8 do Token starts hereLast character read

13
Returning: token + attributes. if xyz > 11 then if, keyword id, value=xyz op, value=>. const, value=11 then, keyword.

14
Finite Automata s1 s4 s2 c a a a b b b b s3 s5 c a Includes: States {s1,s2,…,s5}. Initial states {s1}. Accepting states {s3,s5}. Alphabet {a, b, c}. Transitions: {(s1,a,s2), (s2, a, s3), …}. Deterministic?

15
Automaton. What is the language? b s0 a ab s1 Formally: An input is a word over the alphabet. A run over a word is an alternating sequence of states and letters, starting from the initial state. Accepting run: ends with an accepting state.

16
Example s0 a ab s1 Input: aabbb Run: s0 a s0 a s0 b s1 b s1 b s1. Accepts. Input: aba Run: s0 a s0 b s1 a s0. Does not accept. b

17
Automaton. What is the language? s0 a a b b s1

18
Automaton. What is the language? s1 a a b b s0

19
Identifying tokens I F T HEN LS E E letter letter|digit

20
Non deterministic automata Allows more than a single transition from a state with the same label. There does not have to be a transition from every state with every label. Allows multiple initial states. Allows transitions. s0s1s2 0,1 1 s3

21
Nondeterministic runs Input: 0100 Run 1: s0 0 s0 1 s0 0 s0 0 s0. Does not accept. Run 2: s0 0 s0 1 s1 0 s2 0 s3. Accepts. Accepts when there exists an accepting run. s0s1s2 0,1 1 s3

22
Determinizing Automata s0s1s2 0,1 1 s3 Each state of D is a set of the states of N. Sa T when T={t|s S and sa t}. The initial state of D includes all the initial states of N. Accepting states in D include at least one accepting state of N.

23
Determinization 0,1 s0s1s2 10,1 s3 s0 s0,s3 s0,s2 s0,s1,s3 s0,s2,s3 s0,s1,s2,s3s0,s1,s2 s0,s

24
Determinization

25
Translating regular expressions into automata L1 L2 L L1 L2 L1.L2 L*

26
Automatic translation (a|b).(a.b)=(a b)(a b)=(a+b).(a+b)=… a b a b a b a b

27
Determinization with transitions. s1s3 a s2s4 b s0 s5 s7s9 a s8s10 b s6 s11 Add to each set states reachable using transitions. s0,s1,s2 s3,s5,s6,s7,s8 s9,s11 s4,s5,s6,s7,s8 s10,s11 a a a b b b

28
Minimization Group all the states together. Separate states according to available exit transitions. Separate a set to two if from some of its states one can reach another set and with others one cannot. Repeat until cannot separate. p0 p1 p3 p2 p4 a a a b b b

29
Minimization Group all the states together. {p0, p1, p2, p3, p4}. p0 p1 p3 p2 p4 a a a b b b

30
Minimization Separate states according to available exit transitions. p0 p1 p3 p2 p4 a a a b b b

31
Minimization p0 p1 p3 p2 p4 a a a b b b Separate a set to two if from some of its states one can reach another set and with others one cannot. Repeat until cannot separate.

32
Can minimize now a b a b bb aa

33
Lex Declarations % Translation rules % Auxiliary procedures

34
Lex behavior Lex Program Lex source program lex.l lex.yy.c C Compiler a.out Input streem Output tokens

35
Lex behavior Translates the definitions into an automaton. The automaton looks for the longest matching string. Either return some value to the reading program (parser), or looks for next token. Lookahead operator: x/y allow the token x only if y follows it (but y is not part of the token).

36
Lex Project Project collection date: Feb 11 th. Work in pairs (singles). Use lex to take a text and check whether the number of open parentheses of any kind is equal to the number of closed parentheses. Exception: Inside quotes. \ is not a closing quote.

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google