Presentation is loading. Please wait.

Presentation is loading. Please wait.

241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions'

Similar presentations


Presentation on theme: "241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions'"— Presentation transcript:

1 241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions' language 241-437, Semester 1, 2011-2012 2. Lexical Analysis

2 241-437 Compilers: lex analysis/2 2 Overview 1. Why Lexical Analysis? 2. Using a Lexical Analyzer 3. Implementing a Lexical Analyzer 4. Regular Expressions (REs) 5. The Expressions Language 6.exprTokens.c 7.From REs to Code Automatically

3 241-437 Compilers: lex analysis/2 3 In this lecture Source Program Target Lang. Prog. Semantic Analyzer Syntax Analyzer Lexical Analyzer Front End Code Optimizer Target Code Generator Back End Int. Code Generator Intermediate Code

4 241-437 Compilers: lex analysis/2 4 1. Why Lexical Analysis? Stream of input text (e.g. from a file) is converted to an output stream of tokens (e.g. structs, records, constants) Simplifies the design of the rest of the compiler – –the code uses tokens, not strings or characters Can be implemented efficiently – –by hand or automatically Improves portability – –non-standard symbols / foreign characters are translated here, so do not affect the rest of the compiler

5 241-437 Compilers: lex analysis/2 5 2. Using a Lexical Analyzer Lexical Analyzer (using chars) Syntax Analyzer (using tokens) Source Program 3. Token, token value 1. Get next token lexical errors syntax errors 2. Get chars to make a token

6 241-437 Compilers: lex analysis/2 6 A Source Program is Chars Consider the program fragment: if (i==j); z=1; else; z=0; endif; The lexical analyzer reads it in as a string of characters: if_(i==j);\n\tz=1; \nelse; \tz=0;\nendif; Lexical analysis divides the string into tokens.

7 241-437 Compilers: lex analysis/2 7 Tokens and Token Values Lexical Analyzer "y = 31 + 28*foo" Syntax Analyzer token token value get tokens (one at a time) get chars

8 241-437 Compilers: lex analysis/2 8 Tokens, Lexemes, and Patterns A token is a lexical type – –e.g id, int A lexeme is a token value – –e.g. " abc", 123 A pattern says how to make a token from chars – –e.g. id = letter followed by letters and digits int = non-empty sequence of digits – –a pattern is defined using regular expressions (REs)

9 241-437 Compilers: lex analysis/2 9 3. Implementing a Lexical Analyzer Issues: Lookahead – –how to group chars into tokens Ignoring whitespace and comments. Separating variables from keywords – –e.g. "if", "else" (Automatically) translating REs into a lexical analyzer.

10 241-437 Compilers: lex analysis/2 10 Lookahead A token is created by reading in characters, and grouping them together. It is not always possible to decide if a token is finished without looking ahead at the next char. For example: – –Is "i" a variable, or the first character of "if"? – –Is "=" an assignment or the beginning of "=="?

11 241-437 Compilers: lex analysis/2 11 4. Regular Expressions (REs) REs are an algebraic way of specifying how to recognise input – –‘algebraic’ means that the recognition pattern is defined using RE operands and operators Covered in more detail in 240-304 "maths for CoE"

12 241-437 Compilers: lex analysis/2 12 4.1. REs in grep grep searches input lines, a line at a time. If the line contains a string that matches grep's RE (pattern), then the line is output. grep "RE" input lines (e.g. from a file) hello andy my name is andy my bye byhe output matching lines (e.g. to a file) continued

13 241-437 Compilers: lex analysis/2 13 Examples grep "and" hello andy my name is andy my bye byhe hello andy my name is andy hello andy my name is andy my bye byhe hello andy my name is andy my bye byhe continued "|" means "or" grep -E "an|my"

14 241-437 Compilers: lex analysis/2 14 grep "hel*" hello andy my name is andy my bye byhe hello andy my bye byhe "*" means "0 or more"

15 241-437 Compilers: lex analysis/2 15 4.2. The RE Language A RE defines a pattern which recognises (matches) a set of strings – –e.g. a RE can be defined that recognises the strings { aa, aba, abba, abbba, abbbba, …} These recognisable strings are sometimes called the RE’s language.

16 241-437 Compilers: lex analysis/2 16 RE Operands There are 4 basic kinds of operands: – –characters (e.g. ‘a’, ‘1’, ‘(‘) – –the symbol  (means an empty string ‘’) – –the symbol {} (means the empty set) – –variables, which can be assigned a RE variable = RE

17 241-437 Compilers: lex analysis/2 17 RE Operators There are three basic operators: – –union ‘|’ – –concatenation – –closure *

18 241-437 Compilers: lex analysis/2 18 Union S | T – –this RE can use the S or T RE to match strings Example REs: a | bmatches strings {a, b} a | b | cmatches strings {a, b, c }

19 241-437 Compilers: lex analysis/2 19 Concatenation S T – –this RE will use the S RE followed by the T RE to match against strings Example REs: a bmatches the string { ab } w | (a b)matches the strings {w, ab}

20 241-437 Compilers: lex analysis/2 20 What strings are matched by the RE (a | ab ) (c | bc) Equivalent to: {a, ab} followed by {c, bc} => {ac, abc, abc, abbc} => {ac, abc, abbc}

21 241-437 Compilers: lex analysis/2 21 Closure S* – –this RE can use the S RE 0 or more times to match against strings Example RE: a*matches the strings: { , a, aa, aaa, aaaa, aaaaa,... } empty string

22 241-437 Compilers: lex analysis/2 22 4.3. REs for C Identifiers We define two RE variables, letter and digit : letter = A | B | C | D... Z | a | b | c | d.... z digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 id is defined using letter and digit : id = letter ( letter | digit )* continued

23 241-437 Compilers: lex analysis/2 23 Strings matched by id include: ab345wh5g Strings not matched: 2$abc****

24 241-437 Compilers: lex analysis/2 24 4.4. RE Summary ExpressionMeaning  Empty pattern aAny pattern represented by ‘a’ abStrings with pattern ‘a’ followed by ‘b’ a|bStrings consisting of pattern ‘a’ or ‘b’ a*Zero or more occurrences of patterns in ‘a’ a+One or more occurrences of patterns in ‘a’ a 3 Patterns in ‘a’ repeated exactly 3 times a?(a |  ) ; Optional single pattern from ‘a’.Any single character

25 241-437 Compilers: lex analysis/2 25 More Operators See the regular expressions "cheat-sheet" at the course website in the "Useful Info" subdirectory: See the regular expressions "cheat-sheet" at the course website in the "Useful Info" subdirectory: –over 80 operators!!

26 241-437 Compilers: lex analysis/2 26 Wild Card Symbol: '.' The ‘.’ stands for any character except the newline – –e.g. grep ‘a..b.$’ chapter1.txt grep ‘t.*t.*t’ /usr/share/dict/words the UNIX/Linux 'dictionary'

27 241-437 Compilers: lex analysis/2 27 grep "a..b." A A's AOL AOL's : adobe alibi ameba /usr/share/dict/words

28 241-437 Compilers: lex analysis/2 28 4.5. REs for Integers and Floats We redefine digit : digit = 0|1|2|3|4|5|6|7|8|9 or digit = [1 – 9] int and float : int = {digit}+ float = {digit}+ "." {digit}+

29 241-437 Compilers: lex analysis/2 29 Integers and floats with exponents: number = {digit}+ ('.' {digit}+ )? ( 'E'('+'|'-')? {digit}+ )?

30 241-437 Compilers: lex analysis/2 30 4.6 More on REs v v See RE summary on the course website: regular_expressions_cheat_sheet.pdf v v I have the standard RE book: – –Mastering Regular Expressions Jeffrey E. F. Freidl O'Reilly & Associates continued

31 241-437 Compilers: lex analysis/2 31 v v There are many websites that explain REs: http://etext.lib.virginia.edu/services/ helpsheets/unix/regex.html http://www.zytrax.com/tech/web/regex.htm http://www.regular-expressions.info

32 241-437 Compilers: lex analysis/2 32 5. The Expressions Language In my expressions language, a program is a series of expressions and assignments. Example: // test2.txt example let x56 = 2 let bing_BONG = (27 * 2) - x56 5 * (67 / 3)

33 241-437 Compilers: lex analysis/2 33 5.1. REs for the Language alpha = a | b | c |... | z | A | B |... | Z digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 alphanum = alpha | digit id = alpha (alphanum | '_' )* int = digit+

34 241-437 Compilers: lex analysis/2 34 keywords = "let" | "SCANEOF" punctuation = '(' | ')' | '+' | '-' | '*' | '/' | '=' | '\n' Ignore: – –whitespace (but not newlines) – –comments ("//" to the end of the line)

35 241-437 Compilers: lex analysis/2 35 5.2. From REs to Tokens Using the REs as a guide, we create tokens and token values. How? In general, the top-level REs (id, num) become tokens, and so do the punctuation and the keywords.

36 241-437 Compilers: lex analysis/2 36 Tokens and Token Values TokenToken Value ID"var" and the id string INT"num" and the value LPAREN'(' RPAREN')' PLUSOP'+' MINUSOP'-' MULTOP'*' DIVOP'/'

37 241-437 Compilers: lex analysis/2 37 TokenToken Value ASSIGNOP'=' NEWLINE'\n' LET"let" SCANEOF eof character

38 241-437 Compilers: lex analysis/2 38 6. exprTokens.c exprTokens.c is a lexical analyzer for the expressions language. It reads in an expressions program on stdin, and prints out the tokens (and their values).

39 241-437 Compilers: lex analysis/2 39 6.1. Usage > gcc -Wall -o exprTokens exprTokens.c >./exprTokens < test2.txt 1: 2: 3: 4: 'let' var(x56) '=' num(2) 5: 'let' var(bing_BONG) '=' '(' num(27) '*' num(2) ')' '-' var(x56) 6: 7: num(5) '*' '(' num(67) '/' num(3) ')' 8: 'eof' > or a Windows C compiler: lcc-win32, http://www.cs.virginia.edu/~lcc-win32/

40 241-437 Compilers: lex analysis/2 40 6.2. Code // constants for tokens and their values #define NUMKEYS 2 typedef enum token_types { LET, ID, INT, LPAREN, RPAREN, NEWLINE, ASSIGNOP, PLUSOP, MINUSOP, MULTOP, DIVOP, SCANEOF } Token; char *tokSyms[] = {"let", "var", "num", "(", ")", "\n", "=", "+", "-", "*", "/", "eof"}; char *keywords[NUMKEYS] = {"let", "SCANEOF"}; Token keywordToks[NUMKEYS] = {LET, SCANEOF};

41 241-437 Compilers: lex analysis/2 41 Callgraph for exrprTokens.c calls

42 241-437 Compilers: lex analysis/2 42 main() and its globals Token currToken; int lineNum = 1; // num lines read in int main(void) { printf("%2d: ", lineNum); do { nextToken(); printToken(); } while (currToken != SCANEOF); return 0; }

43 241-437 Compilers: lex analysis/2 43 Printing the Tokens #define MAX_IDLEN 30 char tokString[MAX_IDLEN]; int currTokValue; // used when token is an integer void printToken(void) { if (currToken == ID) // an ID, variable name printf("%s(%s) ", tokSyms[currToken], tokString); else if (currToken == INT)// a number printf("%s(%d) ", tokSyms[currToken], currTokValue); // show value else if (currToken == NEWLINE) printf("%s%2d: ", tokSyms[currToken], lineNum); // print newline token else printf("'%s' ", tokSyms[currToken]); // other toks } // end of printToken()

44 241-437 Compilers: lex analysis/2 44 Getting a Token void nextToken(void) { currToken = scanner(); }

45 241-437 Compilers: lex analysis/2 45 scanner() Overview Token scanner(void)// converts chars into a token { int inCh; clearTokStr(); if (feof(stdin)) return SCANEOF; while ((inCh = getchar()) != EOF) { /* EOF is ^D */ if (inCh == '\n') { lineNum++; return NEWLINE; } else if (isspace(inCh)) // do nothing continue;

46 241-437 Compilers: lex analysis/2 46 else if (isalpha(inCh)){ // ID= ALPHA (ALPHA_NUM| '_')* // read in chars to make id token // return ID or keyword } else if (isdigit(inCh)){ // INT = DIGIT+ // read in chars to make int token // change token to int return INT; } else if (inCh == '(') return LPAREN; else if... // more tests of inCh... else if (inCh == '=') return ASSIGNOP; else lexicalErr(inCh); } return SCANEOF; } // end of scanner() punctuation

47 241-437 Compilers: lex analysis/2 47 Processing an ID : else if (isalpha(inCh)){ // ID = ALPHA (ALPHA_NUM | '_')* extendTokStr(inCh); for (inCh = getchar(); (isalnum(inCh) || inCh == '_'); inCh = getchar()) extendTokStr(inCh); ungetc(inCh, stdin); return checkKeyword(); } : in scanner()

48 241-437 Compilers: lex analysis/2 48 Token String Functions void clearTokStr(void) // reset the token string to be empty { tokString[0] = '\0'; tokStrLen = 0; } // end of clearTokStr() void extendTokStr(char ch) // add ch to the end of the token string { if (tokStrLen == (MAX_IDLEN-1)) printf("Token string too long for %c\n", ch); else { tokString[tokStrLen] = ch; tokStrLen++; tokString[tokStrLen] = '\0'; // terminate string } } // end of extendTokStr()

49 241-437 Compilers: lex analysis/2 49 Checking for a Keyword Token checkKeyword(void) { int i; for(i=0; i<NUMKEYS; i++) { if(!strcmp(tokString, keywords[i])) return keywordToks[i]; } return ID; } // end of checkKeyword()

50 241-437 Compilers: lex analysis/2 50 Processing an INT : else if (isdigit(inCh)){ // INT = DIGIT+ extendTokStr(inCh); for (inCh = getchar(); isdigit(inCh); inCh = getchar()) extendTokStr(inCh); ungetc(inCh, stdin); currTokValue = atoi(tokString); // token --> int return INT; } : in scanner()

51 241-437 Compilers: lex analysis/2 51 Reporting an Error void lexicalErr(char ch) { printf("Lexical error at \"%c\" on line %d\n", ch, lineNum); exit(1); } No recovery attempted.

52 241-437 Compilers: lex analysis/2 52 6.3. Some Good News Most programming languages use very similar lexical analyzers – –e.g. the same kind of IDs, INTs, punctuation, and keywords Once you've written one lexical analyzer, you can reuse it for other languages with only minor changes.

53 241-437 Compilers: lex analysis/2 53 7. From REs to Code Automatically 1. Write the REs for the language. 2. Convert to Non-deterministic Finite Automata (NFA). 3. Convert to Deterministic Finite Automata (DFA) 4. Convert to a table that can be 'plugged' into an 'empty' lexical analyser. There are tools that will do stages 2-4 automatically. We'll look at one such tool, lex, in the next chapter.


Download ppt "241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions'"

Similar presentations


Ads by Google