Download presentation
1
Scanning & Parsing with Lex and YACC
Submissions: 99 Average for A2: 71% Early submission bonus: 1 Full marks: 5 16 teams attempted nonce bonus 7 got full marks 7 teams attempted ACC bonus Can we generate code to support mundane coding tasks and safe time? Scanning & Parsing with Lex and YACC Give you an example for Milestone 1. Hans-Arno Jacobsen ECE 297 Powerful, but not easy
2
CoursePeer – try it out! Developed by a former ECE297 student
Many of the videos under tips & tricks are from him too Short video about CoursePeer To sign up and auto-enrol under ECE297, use this link Will have a quick demo and use it on Wednesday for our Q&A session
3
Know your tools! Can we generate code based on a specification of what we want? Is the specification simpler than writing a program for doing the same task? Fully automated program generation has been a dream since the early days of computing.
4
Where do we need parsing in the storage server?
5
Where do we need parsing in the storage server?
Configuration file (file) Bulk loading of data files (file) Protocol messages (network) Command line arguments (string)
6
Parsing PROPERTY VALUE server_host localhost server_port 1111
default.conf – the way the disk may see it server_host localhost \n server_port 1111 \n table marks \n # This data directory may be an absolute or relative path. \n data_directory ./data \n\n\n \EOF PROPERTY VALUE (TABLE TABLE-NAME)+ server_host localhost server_port 1111 table marks data_directory ./data Tokens
7
Scenarios Where we’d like to safe time in writing a quick language processor?
Conceptually speaking In our storage servers Languages Data description language Script language Markup language System configurations Workload generation Languages Data schema & data Query language Output formatting (Web, Latex, PDF, Word, Excel) Storage server configuration Benchmarking
8
Parser generation from 30K feet
Written by developer Specification Specification Generator Generated code Generator Other code Written by developer Other code Compiler / Linker Execut- able
9
Scanning & parsing I server_host localhost \n server_port 1111 \n table marks \n # This data PROPERTY VALUE PROPERTY VALUE … Scanning PROPERTY VALUE (TABLE TABLE-NAME)+ Parsing Verify content, add to data structures, … Processing
10
Regular expressions (TABLE TABLE-NAME)+
Patterns (TABLE TABLE-NAME)+ TABLE TABLE-NAME TABLE TABLE-NAME TABLE TABLE-NAME … Regular expressions (formal languages) Extended regular expressions (UNIX)
11
Scanning & parsing II Parsing is really two steps
Scanning (a.k.a. tokenizing or lexical analysis) Parsing, i.e., analysis of structure and syntax according to a grammar (i.e., a set of rules) flex is the scanner generator (open source) Fast Lex for lexical analysis YACC is the parser generator Yet Another Compiler Compiler for structural and syntax analysis Lex and YACC work together Generated scanner drives the generated parser We use flex (fast Lex) and Bison (GNU YACC) There are myriads of other tools for Java, C++, …, some of which combine Lex/Yacc into one tool (e.g., javacc)
12
Objectives for today Cover the basics of Lex & Yacc
Everybody should have an appreciation of the potential of these tools There is a lot more detail that remains unsaid To challenge you
13
representation of input)
Lex & YACC overview server_host localhost \n server_port 1111 \n table marks \n # This data directory may be an absolute or relative path. \n data_directory ./data \n\n\n \EOF Lexical Analyzer input stream token stream PROPERTY VALUE PROPERTY VALUE Output defined by actions in parser specification (often an in-memory representation of input) Structural Analyzer token stream
14
Lexical Analysis with Lex
15
Lex introduction Input specification (*.l) lex.yy.c input stream
Synonyms: lexical analyzer, scanner, lexer, tokenizer flex is fast Lex Lex introduction Input specification (*.l) flex You can control the name of generated file lex.yy.c C compiler Lexical Analyzer input stream token stream You generate the lexical analyzer by using flex
16
Lex Input specification for lex – the “program”
Three parts: Definitions, Rules, User code Use “%%” as a delimiter for each part First part: Definitions Options used by flex inside the scanner Defines variables & macros Code within “%{” and “%}” directly copied into the scanner (e.g., global variables, header files) Second part: Rules Patterns and corresponding actions Actions are executed when corresponding pattern(s) matches Patterns are defined by regular expressions
17
Parsing the configuration file of Milestone 1
{host} { return HOST_PROPERTY; } {port} { return PORT_PROPERTY; } table { return TABLE; } {dir} { return DDIR_PROPERTY; } [\t\n ]+ { } #.*\n { } {a2Z}* { yylval.sval = strdup(yytext); return STRING; } [0-9]+ { yylval.pval = (int) atoi(yytext); return PORT_NUMBER; } { return yytext[0]; } … %{ #include "config_parser.tab.h" ... %} a2Z [a-zA-Z] host server_host port server_port dir data_directory %% Pattern Action Shorthands for use below config_parser.l
18
flex pattern matching principles
Actions are executed when patterns match Tokens are returned to caller; next pattern … Patterns match a given input character or string only once Input stream is consumed flex executes the action for the longest possible matching input Order of patterns in the spec. is important
19
Note the flex syntax on the next slides.
Regular expressions Concise description of a character string Used widely in tools (editors, text retrieval, …); Main operators A | B matches A or B A (A | B) matches A followed by A or B A* 0 or more occurrences of A A? 0 or 1 A+ 1 or more Note the flex syntax on the next slides.
20
flex regular expressions by example I (Really: extended regular expressions)
`x‘ match the character 'x' `.‘ any character (byte) except newline `[xyz]’ match either an 'x', a 'y', or a 'z' `[abj-oZ]‘ match an 'a', a 'b', any letter from 'j' through 'o', or a 'Z‘ `[^A-Z]‘ a "negated character class", i.e., any character EXCEPT those in the class `[^A-Z\n]’ any character EXCEPT an uppercase letter or a newline
21
flex regular expression by example II
`r*‘ zero or more r's, where r is any regular expression `r+‘ one or more r's `r?‘ zero or one r (that is, “an optional r”) ‘r{2,5}‘ anywhere from two to five r's `r{2,}‘ two or more r's `r{4}‘ exactly 4 r's ‘<<EOF>>' an end-of-file r is any regular expression
22
flex regular expressions
There are many more expressions, see manual Form complex expressions E.g.: IP address, names, … The expression syntax is used in other tools as well (well worth learning)
23
Parsing the configuration file of Milestone 1
%{ #include "config_parser.tab.h" ... %} a2Z [a-zA-Z] host server_host port server_port dir data_directory %% {host} { return HOST_PROPERTY; } {port} { return PORT_PROPERTY; } table { return TABLE; } {dir} { return DDIR_PROPERTY; } [\t\n ]+ { } #.*\n { } {a2Z}* { yylval.sval = strdup(yytext); return STRING; } [0-9]+ { yylval.pval = (int) atoi(yytext); return PORT_NUMBER; } { return yytext[0]; } <<EOF>> { return 0; } User-defined variable in YACC (conveys token value to YACC) server_host localhost server_port 1111 table marks data_directory ./data config_parser.l
24
Parsing with Yacc
25
You can control the name of generated file
YACC introducing You can control the name of generated file Input specification (*.y) YACC y.tab.c C compiler Output defined by actions in parser specification Syntax analyzer / parser token stream, e.g., via flex From the specified grammar, YACC generates a parser which recognizes “sentences” according to the grammar
26
YACC Input specification for YACC (similar to flex)
Three parts: Definitions, Rules, User code Use “%%” as a delimiter for each part First part: Definitions Definition of tokens for the second part and for use by flex Definition of variables for use by the parser code Second part: Rules Grammar for the parser Third part: User code The code in this part is copied into the parser generated by YACC
27
Configuration file parser Milestone 1
%{ #include <string.h> #include <stdio.h> struct table *tl, *t; struct configuration *c; /* define a linked list of table names */ struct table { char *table_name; struct table *next; }; /* define a structure for the configuration information */ struct configuration { char *host; int port; struct table *tlist; char *data_dir; }; Definition section config_parser.y
28
Configuration file parser Milestone 1
%} %union{ char *sval; // String value (user defined) int pval; // Port number value (user defined) } %token <sval> STRING %token <pval> PORT_NUMBER %token HOST_PROPERTY PORT_PROPERTY DDIR_PROPERTY TABLE %% Definition section cont’d. config_parser.y
29
Configuration file parser Milestone 1
property_list: HOST_PROPERTY STRING PORT_PROPERTY NUMBER table_list data_directory ; table_list: table_list TABLE STRING | TABLE STRING data_directory: DDIR_PROPERTY STRING ; %% (Grammar) Rules section (simplified) config_parser.y
30
(Grammar) Rules section
struct configuration { char *host; int port; struct table *tlist; char *data_dir; }; struct configuration *c; data_directory: DDIR_PROPERTY STRING { c = (struct configuration *) malloc(sizeof(struct configuration)); // Check c for NULL c->data_dir = strdup( $2 ); } ; $1 $2 (Grammar) Rules section (details) config_parser.y
31
(Grammar) Rules section
struct configuration { char *host; int port; struct table *tlist; char *data_dir; }; struct configuration *c; property_list: HOST_PROPERTY STRING PORT_PROPERTY PORT_NUMBER table_list data_directory { c->host = strdup( $2 ); c->port = $4; c->tlist = tl; } ; (Grammar) Rules section (details) config_parser.y
32
Configuration file parser Milestone 1
property_list: HOST_PROPERTY STRING PORT_PROPERTY NUMBER table_list data_directory ; table_list: table_list TABLE STRING | TABLE STRING data_directory: DDIR_PROPERTY STRING ; %% … TABLE STRING TABLE STRING (Grammar) Rules section (simplified) config_parser.y
33
table_list is a recursive rule
Example table specification in configuration file table MyCourses table MyMarks table MyFriends table_list: table_list TABLE STRING | TABLE STRING ; Terminology table_list is called a non-terminal TABLE & STRING are terminals
34
Recursive rule execution
table_list : table_list TABLE STRING table_list TABLE STRING TABLE STRING TABLE STRING TABLE STRING TABLE STRING table MyCourses table MyMarks table MyCourses table MyFriends table MyMarks table MyCourses table MyCourses table MyMarks table MyFriends table_list: table_list TABLE STRING | TABLE STRING ;
35
table_list TABLE STRING {
struct table *tl, *t; struct table { char *table_name; struct table *next; }; table_list: table_list TABLE STRING { t = (struct table *) malloc(sizeof(struct table)); t->table_name = strdup( $3 ); t->next = tl; tl = t; } | TABLE STRING tl = (struct table *) malloc(sizeof(struct table)); tl->table_name = strdup( $2 ); tl->next = NULL; ; $1 $2 $3 t->next = tl table tl = t $1 $2 tl->next = NULL tl table config_parser.y
36
How to invoke the parser
int main (int argc, char **argv){ FILE *f; extern FILE *yyin; if (argc == 2) { f = fopen(argv[1],"r"); if (!f){ …// error handling …} yyin = f; while( ! feof(yyin) ) { if (yyparse() != 0) { … yyerror(""); exit(0); }; } fclose(f); yylex() for calling generated scanner by default called within yyparse()
37
In the Makefile lexer: config_parser.l ${LEX} config_parser.l ${CC} ${CFLAGS} ${INCLUDE} -c lex.yy.c yaccer: config_parser.y ${YACC} -d config_parser.y ${CC} ${CFLAGS} ${INCLUDE} -c config_parser.tab.c parser: config_parser.tab.o lex.yy.o ${CC} ${CFLAGS} ${INCLUDE} -c parser.c ${CC} -o p ${CFLAGS} ${INCLUDE} lex.yy.o \ config_parser.tab.o \ parser.o
38
Benefits Faster development
Compared to manual implementation Easier to change the specification and generate new parser Than to modify 1000s of lines of code to add, change, delete an existing feature Less error-prone, as code is generated Cost: Learning curve Invest once, amortized over 40+ years career
39
If you want to know more Lecture, examples and some recommended reading are enough to tackle all of the parsing for Milestone 3 & 4 3rd and 4th year lectures on Compilers may show you the algorithms behind & inside Lex & YACC Lectures on Computability and Theory of Computation may also show you these algorithms
41
Regular expressions annotated with actions
A flex specification %{ #include <stdio.h #include "y.tab.h" int c; extern int yylval; %} %% " " ; [a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); } [0-9] { yylval = c - '0'; return(DIGIT); [^a-z0-9\b] { return(c); The Header The “Guts”: Regular expressions annotated with actions
42
The header %{ #include <stdio.h #include "y.tab.h" int c;
extern int yylval; %} %% Temporary variable(s) Special variable defined in scanner used in parser for transferring values associated with tokens to parser dividing line between header and rules section
43
The rules %% " " ; [a-z] { c = yytext[0]; yylval = c - 'a';
" " ; [a-z] { c = yytext[0]; yylval = c - 'a'; return (LETTER); } [0-9] { yylval = c - '0'; return (DIGIT); [^a-z0-9\b] { return(c); yytext: the string associated with the token the string associated with the token the string associated with the token
44
sets yylval to the character’s alphabetical order
The rules sets yylval to the character’s alphabetical order %% " " ; [a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); } [0-9] { yylval = c - '0'; return(DIGIT); [^a-z0-9\n] { return(c); sets yylval to digit’s numerical value otherwise simply returns that character; presumably it’s an operator: +*-, etc.
45
Simple example Implement a calculator which can recognize adding or subtracting of numbers [linux33]% ./y_calc 1+101 = 102 [linux33] % ./y_calc = 1000 [linux33] %
46
Example – the Lex part %{ #include <math.h> #include "y.tab.h"
extern int yylval; %} %% [0-9]+ { yylval = atoi(yytext); return NUMBER; } [\t ]+ ; /* Do nothing for white space */ \n return 0;/* End of the logic */ . return yytext[0]; Definitions pattern action Rules
47
Example – the Yacc part %token NAME NUMBER %%
statement: NAME '=' expression | expression { printf("= %d\n", $1); } ; expression:expression '+' NUMBER { $$ = $1 + $3; } |expression '-' NUMBER { $$ = $1 - $3; } | NUMBER { $$ = $1; } Definitions Include Yacc library (-ly) Rules
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.