Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.

Similar presentations


Presentation on theme: "Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine."— Presentation transcript:

1 Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine (yylex) that recognizes the tokens (and performs corresponding actions).

2 Lex source program {definition} % {rules} % {user subroutines} Rules: Each regular expression specifies a token. Default action for anything that is not matched: copy to the output Action: C/C++ code fragment specifying what to do when a token is recognized.

3 lex program examples: ex1.l and ex2.l –‘lex ex1.l’ produces the lex.yy.c file that contains a routine yylex(). The int yylex() routine is the scanner that finds all the regular expressions specified. –yylex() returns a non-zero value (usually token id) normally. –yylex() returns 0 when end of file is reached. –Need a drive to test the routine. Main.cpp is an example. You need to have a yywrap() function in the lex file (return 1), see the function in ex1.l. –Something to do with compiling multiple files.

4 Lex regular expression: contains text characters and operators. –Letters of alphabet and digits are always text characters. Regular expression integer matches the string “integer” –Operators: “\[]^-?.*+|()$/{}%<> When these characters happen in a regular expression, they have special meanings –Lex regular expressions cannot have space in them!!!

5 –operators (characters that have special meanings): “\[]^-?.*+|()$/{}%<> ‘*’, ‘+’, ‘|’, ‘(‘,’)’ -- used in regular expression ‘ “ ‘ -- any character in between quote is a text character. –E.g.: “xyz++” == xyz”++” ‘\’ -- escape character, –To get the operators back: “xyz++” == ?? –To specify special characters: \40 == “ “ ‘[‘ and ‘]’ -- used to specify a set of characters –e.g: [a-z], [a-zA-Z], –Every character in it except ^, - and \ is a text character – [-+0-9], [\40-\176] ‘^’ -- not, used as the first character after the left bracket –E.g [^abc] – any character except ‘a’, ‘b’ or ‘c’. –[^a-zA-Z] -- ??

6 –operators (characters that have special meanings): “\[]^-?.*+|()$/{}%<> ‘.’ -- every character ‘?’ -- optional ab?c matches ‘ac’ or ‘abc’ ‘/’ -- used in character lookahead: –e.g. ab/cd -- matches ab only if it is followed by cd ‘{‘’}’ -- enclose a regular definition ‘%’ -- has special meaning in lex ‘$’ -- match the end of a line, ‘^’ -- match the beginning of a line –ab$ == ab/\n ‘ ’: start condition (more context sensitivity support, see the paper for details).

7 –Order of pattern matching: Always matches the longest pattern. When multiple patterns matches, use the first pattern. –To override, add “REJECT” in the action.... % Ab {printf(“rule 1\n”);} Abc {printf(“rule 2\n”);} {letter}{letter|digit}* {printf(“rule 3\n”);} % Input: Abc What happened when at ‘.*’ as a pattern? Should regular expressions for reserved words happen before or after the regular expression for identifier?

8 –Manipulate the lexeme and/or the input stream: yytext -- a char pointer pointing to the matched C string yyleng -- the length of the matched string I/O routines to manipulate the input stream: –yyinput() -- get a character from the input character, return <=0 when reaching the end of the input stream, the character otherwise »yyinput() for c++, input() for c. –unput( c ) -- put c back onto the input stream –Deal with comments: (/* ….. */ »Why is pattern “/*”.*”*/” a problem? % … “/*” {char c1; c2 = yyinput(); if (c2 <=0) {lex_error(“unfinished comment” …} else { c1 = c2; c2 = yyinput(); while (((c1!=‘*’) || (c2 != ‘/’)) && (c2 > 0)) {c1 = c2; c2 = yyinput();} if (c2 <= 0) {lex_error( ….) }

9 –Reporting errors: What kind of errors? Not too many. –Characters that cannot lead to a token –unended comments (can we do it in later phases?) –unended string constants.

10 –Reporting errors: How to keep track of current position (which line, which column)? –Use to global variable for this: yyline, yycolumn %{ int yyline = 1, yycolumn = 1; %}... % “\n” {yyline++; yycolumn = 0;} [ \t]+ {/* do nothing*/ yycolumn += yyleng;} If {yycolumn += yyleng; return (IFNumber);} “+” {yycolumn += 1; return (PLUSNumber);} {letter}{letter|digit}* {yylval = idtable_insert(yytext); yycolumn += yyleng; return(IDNumber);}... %

11 Dealing with identifiers, string constants. –Data structures: Put the lexeme in a string table: e.g. vector of C strings. See token.l –Recognizing constant strings with special characters Assuming string cannot pass line boundary. Use yymore() “[^”\n]* {char c; c = yyinput(); if (c != ‘”’) error else if (yytext[yyleng-1] == ‘\\’) { unput( c ); yymore(); } else {/* find the whole string, normal process*/}

12 Put it all together Checkout token.l and main.cpp in lex2.tar


Download ppt "Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine."

Similar presentations


Ads by Google