Lexical and Syntactic Analysis

Lexical and Syntactic Analysis
We look at two of the tasks involved in the compilation process break the source code into lexemes units (lexical analysis) the goal is to identify each lexeme and assign it to its proper token parse the lexical components into their syntactic uses (syntactic analysis) the goal is to parse the lexemes into a parse tree During both lexical and syntactic analysis, errors can be detected and reported

Continued Lexical analysis takes source code that consists of
reserved words, identifiers, punctuation, blank spaces, comments Identify each item for its lexical category e.g., the reserved word “for”, a semicolon, an identifier, a close ), etc How do we perform this operation? use a relatively simple state transition diagram to describe the various entities of interest implement a program based on recursive functions, one per lexical category, to search the next portion of the program to identify a component’s category

Example

Recognizing Names/Words/Numbers

Implementation int lex( ) { getChar( ); switch (charClass) {
case LETTER: addChar( ); getChar( ); while (charClass == LETTER || charClass == DIGIT) { } return lookup(lexeme); break; case DIGIT: while (charClass == DIGIT) { addChar( ); getChar( ); return INT_LIT; break; } /* End of switch */ } /* End of function lex */ The code above utilizes three functions: getChar( ) - gets the next character from the input (file) and puts it in the variable nextChar, and determines the class of the character and puts the class in the variable charClass addChar( ) - puts the character from nextChar into the place the lexeme is being accumulated lookup(lexeme) - determines whether the string lexeme is a reserved word and returns a code, 0 if the lexeme should be treated as an identifier, or a positive int value if the lexeme is a reserved word

Parsing The process of generating a parse tree from a set of input that identifies the grammatical categories of each element of the input identifying if and where errors occur Parsing is similar whether for a natural language or a programming language a good parser will continue parsing even after errors have been found this requires a recovery process Parsing is based on the language’s grammar but must also include the use of attributes in the grammar (that is, an attribute grammar)

Forms of Parsers Top-down (used in the LL parser algorithm)
start with LHS rules, map to RHS rules until terminal symbols have been identified, match these against the input Bottom-up (used in the LR parser algorithms) start with RHS rules and input, collapse terminals and non-terminals into non-terminals until you have reached the starting non-terminal Parsing is an O(n3) problem where n is the number of items in the input if we cannot determine a single token for each lexeme, the problem becomes O(2n)! by restricting our parser to work only on the grammar of the given language, we can reduce the complexity to O(n)

Top-Down Parsing Uses an LL parser (left-to-right, using leftmost derivation) Generate a recursive-decent parser from a BNF grammar non-terminal grammatical categories converted into functions e.g., <expr>, <if>, <factor>, <assign>, <id> each function, when called, parses the next lexeme using a function called lex( ) and maps it to terminal symbols and/or calls further functions Two restrictions on the grammar cannot have left recursion if a rule has recursive parts, those parts must not be the first items on the RHS of a rule, for instance <A>  <A>b cannot be allowed but <A>  b<A> can must pass the pairwise disjointness test (covered shortly) Algorithms exist to alter a grammar so that it passes both restrictions

Recursive Decent Parser Example
Recall our example expression grammar from chapter 3: <expr>  <term> {(+ | -) <term>} <term>  <factor> {(* | /) <factor>} <factor>  id | ( <expr> ) void expr( ) { term( ); while (nextToken = = PLUS_CODE || nextToken = = MINUS_CODE){ lex( ); term( ); } } void term( ) { factor( ); while (nextToken = = MULT_CODE || nextToken = = DIV_CODE) { lex( ); void factor( ) { if(nextToken = = ID_CODE) lex( ); else if(nextToken = = LEFT_PAREN_CODE) { expr( ); if(nextToken = = RIGHT_PAREN_CODE) else error( ); } Consider the while statement: <while>  while(<bool_expr>) <stmt>; | while(<bool_expr>) { <stmt_list> } void while( ) { if (nextToken != WHILE_CODE) error( ); else { lex ( ); if (nextToken != LEFT_PAREN_CODE) error ( ); boolExpr( ); if (nextToken != RIGHT_PAREN_CODE) else stmtList( ); }

If Statement Example We expect an if statement to look like this:
void ifstmt( ) { if (nextToken != IF_CODE) error( ); else { lex( ); if (nextToken != LEFT_PAREN_CODE) else { boolexpr( ); if (nextToken != RIGHT_PAREN_CODE) statement( ); if(nextToken = = ELSE_CODE) { } We expect an if statement to look like this: if (boolean expr) statement; optionally followed by: else statement; Otherwise, we return an error What would a C for-loop parser look like? Consider the general form: for(init; condition; incr) stmt; We can write the init function as follows: void init( ) { lex( ); if (assign( )) lex( ); else error( ); while (nextToken == COMMA_CODE) init( ); if(nextToken != SEMICOLON_CODE) error( ); }

Pairwise Disjointness
Consider a rule with multiple RHS parts, for instance <A>  a<B> | a<C> The LL parser must be able to select the part of the rule to simplify, the first non-terminal on each right-hand side rule must differ (that is, each RHS mapping must start uniquely to make the choice obvious) This is pairwise disjointness Here are some examples A  aB | bAb | c – passes (pairwise disjoint) A  aB | aAb – fails (not pairwise disjoint) <var>  id | id[<expr>] – fails but can be made pairwise disjoint as follows <var>  id<next> <next>  e | [<expr>] (e means empty set)

Bottom-Up Parsing To avoid the restrictions on an LL parser, we might want to use an LR parser (left-to-right parsing, rightmost derivation) Implemented using a pushdown automaton a stack added to the state diagrams seen earlier Parser has two basic processes shift move items from the input onto the stack reduce take consecutive stack items and reduce them for instance, if we have a rule <A>  a<B> and we have a and <B> on the stack, reduce them to <A> the parser is easy to implement but we must first construct what is known as an LR parsing table there are numerous algorithms to generate the parsing table

Parser Algorithm Given input S0, a1, …, an, $
S0 is the start state a1, …, an are the lexemes that make up the program $ is a special end of input symbol If action[Sm, ai] = Shift S, then push ai, S onto stack and change state to S If action[Sm, ai] = Reduce R, then use rule R in the grammar and reduce the items on the stack appropriately, changing state to be the state GOTO[Sm-1, R] If action[Sm, ai] = Accept then the parse is complete with no errors If action[Sm, ai] = Error (or the entry in the table is blank) then call error-handling and recovery routine The Parsing table stores the values of action[x, y] and GOTO[x, y]

Example Grammar: 1. E  E + T 2. E  T 3. T  T * F Parse of id+id*id$
5. F  (E) 6. F  id Parse of id+id*id$ Stack Input Action 0 id+id*id$ S5 0id5 +id*id$ R6(GOTO[0,F]) 0F3 +id*id$ R4(GOTO[0,T]) 0T2 +id*id$ R2(GOTO[0,E]) 0E1 +id*id$ S6 0E id*id$ S5 0E1+6id *id$ R6(GOTO[6,F]) 0E1+6F *id$ R4(GOTO[6,T]) 0E1+6T *id$ S7 0E1+6T9*7 id$ S5 0E1+6T9*7id5 $ R6(GOTO[7,F]) 0E1+6T9*7F10 $ R3(GOTO[6,T]) 0E1+6T $ R1(GOTO[0,E]) 0E $ ACCEPT Here’s how this example works. First item in the input is id and state is 0. We push id onto the stack switch to state S5. Next item is + in state 5, we use R6 (reduce rule 6 under the grammar). This rule says F  id, so we reduce the stack from id to F. While changing the stack, the state (to the left of id) is 0, so we use row 0, column F to find our new state, 3. Now on the stack we have 0F3. We did not shift anything to the stack, so our input is still +, in state 3. State 3, input + is R4. This grammar rule says T F. Additionally, the state to the left is still 0, so we use GOTO([0, T]) which is state 2. We switch 0F3 with 0T2. We still have a + as our input, now in state 2, so we use R2. this rule says ET, and we goto [0, E] which is state 1. Our stack is now 0E1. With input + and state 1, we do S6 (shift, state 6). This pushes the input (+) onto the stack and switches us to state 6. Now on the stack is 0E1+6. The next input is id, under state 6 we get S5, which says to push the input onto the stack and switch to state 5. We now have 0E1+6id5. The next input is * in state 5. Here, we have R6 (reduce Fid) but now the state to the left is 6, so we use GOTO([6,F]) which switches us to state 3. Our stack is now 0E1+6F3 with an input of *. We get R4 again to reduce FT and we switch to GOTO([6, T]) or state 9. Now our stack is 0E1+6T9 with an input of *. Here, we have S7, so we push *7 and switch to state 7 giving us 0E1+6T9*7. Our new input is id in state 7, or S5, so we push id5 and have 0E1+6T9*7id5. Our new input is $ in state 5 which is R6, so we reduce id to F and GOTO([7, F]) or state 10, so our stack is 0E1+6T9*7F10 with input of $. Now we get R3, or TT*F, so we reduce the T*F portion of the stack (that is, T9*7F10). To the left of this is 6, so we use GOTO([6,T]), or state 9, so the stack is now 0E1+6T9. Our input is still $, so we use R1. This rule says EE+T, so we reduce the stack from E1+6T9 to E, and we GOTO([0,E]), or state 1. Now our stack is 0E1 with an input of $. We reach “accept” which means that we have accepted the input as legal. Our parse has been to apply from top to bottom, grammar rules 1 (for id+…), 3 (for id*id), and 6/4/2 to convert an id.

Lexical and Syntactic Analysis

Similar presentations

Presentation on theme: "Lexical and Syntactic Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lexical and Syntactic Analysis

Similar presentations

Presentation on theme: "Lexical and Syntactic Analysis"— Presentation transcript:

Similar presentations

About project

Feedback