Presentation is loading. Please wait.

Presentation is loading. Please wait.

COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.

Similar presentations


Presentation on theme: "COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow."— Presentation transcript:

1 COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow

2 Compiler Front End Lexical Analysis –Identifies tokens Parser –Determines derivation –Uses parse tree Reduced to abstract syntax tree –To eliminate uninteresting branches See fig. 2.2

3 From Text to Tokens Roadmap 1.Reading program text 2.Lexical vs syntactic analysis 3.Regular expressions & descriptions 4.Lexical analysis 5.Creating lexical analyzers 6.Symbol handling

4 Reading Program Text Reading individual characters –Simple & attractive –But very inefficient Buffer input –How big a buffer –Unlimited line lengths Read whole program into buffer –Practical & efficient on modern machines

5 Handling lines Different ways to identify lines –\n –\n\r –Fixed length record –Variable length record Solve problem for portable compiler by converting to standard form as soon as possible

6 Lexical vs Syntactic Analysis Could use syntactic analysis for everything Lexical analysis routines –Simpler –Faster –Comments and white space are not part of syntax

7 Regular Expressions Regular expression describes a set of strings Rules for constructing REs –See fig. 2-4 –Escape special symbols by enclosing in double quotes “+?”, “””” for quote itself

8 Regular Descriptions EBNF grammar with additional rule –No non-terminal can be used before it has been defined –Can be expanded to form of L → RE This form is called to token description for the token L

9 Lexical Analysis by Hand Generally simple to write –Finite state machine model –Walk through input characters –Use switch statement or conditionals to classify cases

10 Example tokens Tokens –Identifier Letter (letter|digit)* (_ (letter|digit)+)* –Integer Digit* –Operators + | - | * | / –Separators : |, | ( | ) | { | }

11 Example Code See figs. 2.5 – 2.13

12 Precomputation Identify character classes –Create a table that has a one entry if the character value is in the class and zero otherwise Digit would have 1’s in positions 48-57 Letter would have 1’s in 65-90 and 97-122 –Table look-up will check a character (see fig 2.14)

13 Table Compression By using a different bit for each character class, we can store 8 tables in the space of one –Do lookup and then and result with desired character class (see fig. 2.15) Alternatively could use bit map instead of byte array –Extra work selecting bit vs anding

14 Automatic Generation of Lexical Analyzer Roadmap 1.Dotted items 2.Concurrent search 3.Precomputing item sets 4.Final lexical analyzer 5.Table compression 6.Error handling

15 Naïve Analysis Match regular expressions Try each token type from the starting point and take the longest (see fig. 2.16) Better to classify by first symbol to limit work Best to check all possibilities in parallel

16 Dotted Items Usually used for parsing but applicable to scanning as well T → α β –Interpret as α has already been matched β remains to be matched –Represents a hypothesis about T (see figs 2.17, 2.18)

17 Item Classification Basic item –Shift item: dot in front of basic pattern –Reduce item: dot at end Non-basic item –Dot in front of repetition subexpression –Dot in front of parenthesized subexpression

18 Moves Character move –Next input is same as character at start of β ε move –Move over an optional item (see fig. 2.19) Sample run for 3.1 on [0-9]* ‘.’ [0-9]+ (see p. 75)

19 Concurrent Search Concurrently move through all applicable definitions –Example with integral_number [0-9]+ and fixed_point_number as before (see p. 77) Goal is algorithm that process each character once, except for back-up

20 LexAn Generator Run-time for scanning (see fig. 2.21) Three functions –Initial item set() (fig. 2.22) Generate initial item sets –Next item set(Item set, ch) (fig. 2.23) Generate next set from moving over ch –Class of token recognized(Item set) (fig. 2.24) Return class of token found

21 Improved Scanner Precompute item sets Uses Finite State Automaton (FSA) model –Initial item set describes start state –Next item set describes transitions Build FSA (see fig. 2.26, 2.27, 2.28) Final analyzer (see fig. 2.29)

22 Table Compression Transition table is sparse and possibly large –Various methods can be used to compress it –(skip details)

23 Error handling Simple method is to add, at the end, a rule that recognizes any single character –Make this error token class –Scanner will skip forward to next valid restart

24 Lex Lexical analyzer generator from standard Unix/Linux Flex is gnu equivalent (see fig. 2.41 for sample)


Download ppt "COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow."

Similar presentations


Ads by Google