Lexical and Syntax Analysis

Lexical and Syntax Analysis
Chapter 4 Lexical and Syntax Analysis

Chapter 4 Topics Introduction Lexical Analysis The Parsing Problem
Recursive-Descent Parsing Bottom-Up Parsing Copyright © 2015 Pearson. All rights reserved.

Introduction Language implementation systems must analyze source code, regardless of the specific implementation approach Nearly all syntax analysis is based on a formal description of the syntax of the source language (BNF) Copyright © 2015 Pearson. All rights reserved.

Syntax Analysis The syntax analysis portion of a language processor nearly always consists of two parts: A low-level part called a lexical analyzer (mathematically, a finite automaton based on a regular grammar) A high-level part called a syntax analyzer, or parser (mathematically, a push-down automaton based on a context-free grammar, or BNF) Copyright © 2015 Pearson. All rights reserved.

Advantages of Using BNF to Describe Syntax
Provides a clear and concise syntax description The parser can be based directly on the BNF Parsers based on BNF are easy to maintain Copyright © 2015 Pearson. All rights reserved.

Reasons to Separate Lexical and Syntax Analysis
Simplicity - less complex approaches can be used for lexical analysis; separating them simplifies the parser Efficiency - separation allows optimization of the lexical analyzer Portability - parts of the lexical analyzer may not be portable, but the parser always is portable Copyright © 2015 Pearson. All rights reserved.

Lexical Analysis A lexical analyzer is a pattern matcher for character strings A lexical analyzer is a “front-end” for the parser Identifies substrings of the source program that belong together - lexemes Lexemes match a character pattern, which is associated with a lexical category called a token sum is a lexeme; its token may be IDENT Copyright © 2015 Pearson. All rights reserved.

Lexical Analysis (continued)
The lexical analyzer is usually a function that is called by the parser when it needs the next token Three approaches to building a lexical analyzer: Write a formal description of the tokens and use a software tool that constructs a table-driven lexical analyzer from such a description Design a state diagram that describes the tokens and write a program that implements the state diagram Design a state diagram that describes the tokens and hand-construct a table-driven implementation of the state diagram Copyright © 2015 Pearson. All rights reserved.

State Diagram Design A naïve state diagram would have a transition from every state on every character in the source language - such a diagram would be very large! Copyright © 2015 Pearson. All rights reserved.

In many cases, transitions can be combined to simplify the state diagram When recognizing an identifier, all uppercase and lowercase letters are equivalent Use a character class that includes all letters When recognizing an integer literal, all digits are equivalent - use a digit class Copyright © 2015 Pearson. All rights reserved.

Reserved words and identifiers can be recognized together (rather than having a part of the diagram for each reserved word) Use a table lookup to determine whether a possible identifier is in fact a reserved word Copyright © 2015 Pearson. All rights reserved.

Convenient utility subprograms: getChar - gets the next character of input, puts it in nextChar, determines its class and puts the class in charClass addChar - puts the character from nextChar into the place the lexeme is being accumulated, lexeme lookup - determines whether the string in lexeme is a reserved word (returns a code) Copyright © 2015 Pearson. All rights reserved.

Lexical Analyzer Implementation:  SHOW front.c (pp. 172-177)
- Following is the output of the lexical analyzer of front.c when used on (sum + 47) / total Next token is: 25 Next lexeme is ( Next token is: 11 Next lexeme is sum Next token is: 21 Next lexeme is + Next token is: 10 Next lexeme is 47 Next token is: 26 Next lexeme is ) Next token is: 24 Next lexeme is / Next token is: 11 Next lexeme is total Next token is: -1 Next lexeme is EOF Copyright © 2015 Pearson. All rights reserved.

Overview From: Chapter 3 of Compilers: Principles, Techniques and Tools, Aho, et al., Addison-Wesley Basic Concepts & Regular Expressions What does a Lexical Analyzer do? How does it Work? Formalizing Token Definition & Recognition Regular Expressions and Languages Reviewing Finite Automata Concepts Non-Deterministic and Deterministic FA Conversion Process Regular Expressions to NFA NFA to DFA Relating NFAs/DFAs /Conversion to Lexical Analysis – Tools Lex/Flex/JFlex/ANTLR

Lexical Analyzer in Perspective
parser symbol table source program token get next token Important Issue: What are Responsibilities of each Box ? Focus on Lexical Analyzer and Parser

Lexical Analyzer in Perspective
Scan Input Remove WS, NL, … Identify Tokens Create Symbol Table Insert Tokens into ST Generate Errors Send Tokens to Parser PARSER Perform Syntax Analysis Actions Dictated by Token Order Update Symbol Table Entries Create Abstract Rep. of Source Generate Errors And More…. (We’ll see later)

Introducing Basic Terminology
What are Major Terms for Lexical Analysis? TOKEN A classification for a common set of strings Examples Include Identifier, Integer, Float, Assign, LParen, RParen, etc. PATTERN The rules which characterize the set of strings for a token – integers [0-9]+ Recall File and OS Wildcards ([A-Z]*.*) LEXEME Actual sequence of characters that matches pattern and is classified by a token Identifiers: x, count, name, etc… Integers: 345, , etc.

Introducing Basic Terminology
Token Sample Lexemes Informal Description of Pattern const if relation id num literal <, <=, =, < >, >, >= pi, count, D2 3.1416, 0, 6.02E23 “core dumped” < or <= or = or < > or >= or > letter followed by letters and digits any numeric constant any characters between “ and “ except “ Classifies Pattern Actual values are critical. Info is : 1. Stored in symbol table 2. Returned to parser

Handling Lexical Errors
Error Handling is very localized, with Respect to Input Source For example: whil ( x := 0 ) do generates no lexical errors in PASCAL In what Situations do Errors Occur? Prefix of remaining input doesn’t match any defined token Possible error recovery actions: Deleting or Inserting Input Characters Replacing or Transposing Characters Or, skip over to next separator to “ignore” problem

How Does Lexical Analysis Work ?
Question is related to efficiency Where is potential performance bottleneck? 3 Techniques to Address Efficiency : Lexical Analyzer Generator Hand-Code / High Level Language Hand-Code / Assembly Language In Each Technique … Who handles efficiency ? How is it handled ?

Basic Scanning technique
Use 1 character of look-ahead Obtain char with getc() Do a case analysis Based on lookahead char Based on current lexeme Outcome If char can extend lexeme, all is well, go on. If char cannot extend lexeme: Figure out what the complete lexeme is and return its token Put the lookahead back into the symbol stream

Formalizing Token Definition
DEFINITIONS: ALPHABET : Finite set of symbols {0,1}, or {a,b,c}, or {n,m, … , z} STRING : Finite sequence of symbols from an alphabet. or abbca or AABBC … A.K.A. word / sentence If S is a string, then |S| is the length of S, i.e. the number of symbols in the string S.  : Empty String, with |  | = 0

Formalizing Token Definition
EXAMPLES AND OTHER CONCEPTS: Suppose: S is the string banana Prefix : ban, banana Suffix : ana, banana Substring : nan, ban, ana, banana Subsequence: bnan, nn Proper prefix, suffix, or substring cannot be all of S

A language, L, is simply any set of strings
Language Concepts A language, L, is simply any set of strings over a fixed alphabet. Alphabet Languages {0,1} {0,10,100,1000,100000…} {0,1,00,11,000,111,…} {a,b,c} {abc,aabbcc,aaabbbccc,…} {A, … ,Z} {TEE,FORE,BALL,…} {FOR,WHILE,GOTO,…} {A,…,Z,a,…,z,0,…9, { All legal PASCAL progs} +,-,…,<,>,…} { All grammatically correct English sentences } Special Languages:  - EMPTY LANGUAGE  - contains  string only

Formal Language Operations
DEFINITION union of L and M written L  M concatenation of L and M written LM Kleene closure of L written L* positive closure of L written L+ L  M = {s | s is in L or s is in M} LM = {st | s is in L and t is in M} L+= L* denotes “zero or more concatenations of “ L L*= L+ denotes “one or more concatenations of “ L

Formal Language Operations Examples
L = {A, B, C, D } D = {1, 2, 3} L  D = {A, B, C, D, 1, 2, 3 } LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 } L2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD} L4 = L2 L2 = ?? L* = { All possible strings of L plus  } L+ = L* -  L (L  D ) = ?? L (L  D )* = ?? {A, B, C, D } ({A, B, C, D } {1, 2, 3}) L  (L (L  D ))

Language & Regular Expressions
A Regular Expression is a Set of Rules / Techniques for Constructing Sequences of Symbols (Strings) From an Alphabet. Let  Be an Alphabet, r a Regular Expression Then L(r) is the Language That is Characterized by the Rules of R

Rules for Specifying Regular Expressions:
 is a regular expression denoting { } If a is in , a is a regular expression that denotes {a} Let r and s be regular expressions with languages L(r) and L(s). Then (a) (r) | (s) is a regular expression  L(r)  L(s) (b) (r)(s) is a regular expression  L(r) L(s) (c) (r)* is a regular expression  (L(r))* (d) (r) is a regular expression  L(r) All are Left-Associative. precedence

EXAMPLES of Regular Expressions
L = {A, B, C, D } D = {1, 2, 3} A | B | C | D = L (A | B | C | D ) (A | B | C | D ) = L2 (A | B | C | D )* = L* (A | B | C | D ) ((A | B | C | D ) | ( 1 | 2 | 3 )) = L (L  D)

Algebraic Properties of Regular Expressions
AXIOM DESCRIPTION r | s = s | r r | (s | t) = (r | s) | t (r s) t = r (s t) r = r r = r r* = ( r |  )* r ( s | t ) = r s | r t ( s | t ) r = s r | t r r** = r* | is commutative | is associative concatenation is associative concatenation distributes over | relation between * and   Is the identity element for concatenation * is idempotent

Towards Token Definition
Regular Definitions: Associate names with Regular Expressions For Example : PASCAL IDs letter  A | B | C | … | Z | a | b | … | z digit  0 | 1 | 2 | … | 9 id  letter ( letter | digit )* Shorthand Notation: “+” : one or more r* = r+ |  & r+ = r r* “?” : zero or one [range] : set range of characters (replaces “|” ) [A-Z] = A | B | C | … | Z Using Shorthand : PASCAL IDs id  [A-Za-z][A-Za-z0-9]* We’ll Use Both Techniques

Token Recognition Tokens as Patterns
How can we use concepts developed so far to assist in recognizing tokens of a source language ? Assume Following Tokens: if, then, else, relop, id, num What language construct are they used for ? Given Tokens, What are Patterns ? if  if then  then else  else relop  < | <= | > | >= | = | <> id  letter ( letter | digit )* num  digit + (. digit + ) ? ( E(+ | -) ? digit + ) ? What does this represent ? What is  ?

What Else Does Lexical Analyzer Do? Throw Away Tokens
Fact Some languages define tokens as useless Example: C whitespace, tabulations, carriage return, and comments can be discarded without affecting the program’s meaning. blank  b tab  ^T newline  ^M delim  blank | tab | newline ws  delim +

Overall ws if then else id num < <= = < > > >= -
Regular Expression Token Attribute-Value ws if then else id num < <= = < > > >= - relop pointer to table entry LT LE EQ NE GT GE Note: Each token has a unique token identifier to define category of lexemes

Regular Expression Examples
All Strings of Upper Case Characters That Contain Five Vowels in Order All Strings of Lower Case Characters start with “tab” or end with “bat” cons  B | C | D | F | … | Z string  cons* A cons* E cons* I cons* O cons* U cons* lowers  a | b | … | z string  ( tab lowers* ) |( lowers* bat )

Regular Expression Examples
All Strings of {1,2,3} where {1,2,3} exist in descending order All Strings of {0,1, …, 9} in which Digits are in Ascending Numerical Order string  3* 2* 1* string  0* 1* 2* 3* 4* 5* 6* 7* 8* 9*

Tokens / Patterns / Regular Expressions
Lexical Analysis - searches for matches of lexeme to pattern Lexical Analyzer returns:<actual lexeme, symbolic identifier of token> For Example: Token Symbolic ID if then else >,>=,<,… := id int real Set of all regular expressions plus symbolic ids plus analyzer define required functionality. alg alg REs --- NFA --- DFA (program for simulation)

Automata & Language Theory
Terminology FSA A recognizer that takes an input string and determines whether it’s a valid string of the language. Non-Deterministic FSA (NFA) Has several alternative actions for the same input symbol Deterministic FSA (DFA) Has at most 1 action for any given input symbol Bottom Line expressive power(NFA) == expressive power(DFA) Conversion can be automated

Finite Automata & Language Theory
A recognizer that takes an input string & determines whether it’s a valid sentence of the language Non-Deterministic : Has more than one alternative action for the same input symbol Can’t utilize algorithm ! Deterministic : Has at most one action for a given input symbol. Both types are used to recognize regular expressions.

NFAs & DFAs Non-Deterministic Finite Automata (NFAs) easily represent regular expression, but are somewhat less precise. Deterministic Finite Automata (DFAs) require more complexity to represent regular expressions, but offer more precision. We’ll discuss both plus conversion algorithms, i.e., NFA  DFA and DFA  NFA

Non-Deterministic Finite Automata
An NFA is a mathematical model that consists of : S, a set of states , the symbols of the input alphabet move, a transition function. move(state, symbol)  state move : S    S A state, s0  S, the start state F  S, a set of final or accepting states.

What Language is defined ? What is the Transition Table ?
Example NFA start 3 b 2 1 a S = { 0, 1, 2, 3 } s0 = 0 F = { 3 }  = { a, b } What Language is defined ? ( a | b )* a b b What is the Transition Table ? state i n p u t 1 2 a b { 0, 1 } -- { 2 } { 3 } { 0 }  (null) moves possible j i  Switch state but do not use any input symbol

Epsilon-Transitions Given the regular expression: (a (b*c)) | (a (b |c+)?) Find a transition diagram NFA that recognizes it. Solution ?

NFA- Regular Expressions & Compilation
Problems with NFAs for Regular Expressions: 1. Valid input might not be accepted 2. NFA may behave differently on the same input Relationship of NFAs to Compilation: 1. Regular expression “recognized” by NFA 2. Regular expression is “pattern” for a “token” 3. Tokens are building blocks for lexical analysis 4. Lexical analyzer can be described by a collection of NFAs. Each NFA is for a language token.

The DFA Save The Day Easy to implement a DFA with an algorithm!
A DFA is an NFA with a few restrictions No epsilon transitions For every state s, there is only one transition (s,x) from s for any symbol x in Σ Corollaries Easy to implement a DFA with an algorithm! Deterministic behavior

NFA vs. DFA NFA smaller number of states Qnfa
In order to simulate it requires a |Qnfa| computation for each input symbol. DFA larger number of states Qdfa In order to simulate it requires a constant computation for each input symbol. caveat - generic NFA=>DFA construction: Qdfa ~ 2^{Qnfa} but: DFA’s are perfectly optimizable! (i.e., you can find smallest possible Qdfa )

One catch... NFA-DFA comparison b

NFA to DFA Conversion – Main Idea
Look at the state reachable without consuming any input, and Aggregate them in macro states

Final Result A state is final If one of the NFA states was final

Pulling Together Concepts
Designing Lexical Analyzer Generator Reg. Expr.  NFA construction NFA  DFA conversion DFA simulation for lexical analyzer Recall Lex Structure Pattern Action … … Each pattern recognizes lexemes Each pattern described by regular expression (a | b)*abb  e.g.  (abc)*ab  etc. Recognizer!

Lex Specification  Lexical Analyzer
Let P1, P2, … , Pn be Lex patterns (regular expressions for valid tokens in prog. lang.) Construct N(P1), N(P2), … N(Pn) What’s true about list of Lex patterns ? Construct NFA: N(P1) Lex applies conversion algorithm to construct DFA that is equivalent!   N(P2)  N(Pn)

Pictorially Lex Specification Lex Compiler Transition Table
(a) Lex Compiler lexeme input buffer FA Simulator Transition Table (b) Schematic lexical analyzer

Example Let : a abb a*b* 3 patterns NFA’s : start 1 b a 2 3 4 5 8 7 6

Can you do this conversion ???
Example – continued(1) Combined NFA :  1 a 2  a b b start 3 4 5 6 a b  7 8 b Construct DFA : (It has 6 states) {0,1,3,7}, {2,4,7}, {5,8}, {6,8}, {7}, {8} Can you do this conversion ???

Example – continued(2) Dtran for this example: abb {8} - {6,8} a*b+
{5,8} none {7} a {2,4,7} {0,1,3,7} Pattern b STATE Input Symbol

Morale? All of this can be automated with a tool!
LEX The first lexical analyzer tool for C FLEX A newer/faster implementation C / C++ friendly JFLEX A lexer for Java. Based on same principles. JavaCC

The Parsing Problem Goals of the parser, given an input program:
Find all syntax errors; for each, produce an appropriate diagnostic message and recover quickly Produce the parse tree, or at least a trace of the parse tree, for the program Copyright © 2015 Pearson. All rights reserved.

The Parsing Problem (continued)
Two categories of parsers Top down - produce the parse tree, beginning at the root Order is that of a leftmost derivation Traces or builds the parse tree in preorder Bottom up - produce the parse tree, beginning at the leaves Order is that of the reverse of a rightmost derivation Useful parsers look only one token ahead in the input Copyright © 2015 Pearson. All rights reserved.

Top-down Parsers Given a sentential form, xA , the parser must choose the correct A-rule to get the next sentential form in the leftmost derivation, using only the first token produced by A The most common top-down parsing algorithms: Recursive descent - a coded implementation LL parsers - table driven implementation Copyright © 2015 Pearson. All rights reserved.

Bottom-up parsers Given a right sentential form, , determine what substring of  is the right-hand side of the rule in the grammar that must be reduced to produce the previous sentential form in the right derivation The most common bottom-up parsing algorithms are in the LR family Copyright © 2015 Pearson. All rights reserved.

The Complexity of Parsing Parsers that work for any unambiguous grammar are complex and inefficient ( O(n3), where n is the length of the input ) Compilers use parsers that only work for a subset of all unambiguous grammars, but do it in linear time ( O(n), where n is the length of the input ) Copyright © 2015 Pearson. All rights reserved.

Recursive-Descent Parsing
There is a subprogram for each nonterminal in the grammar, which can parse sentences that can be generated by that nonterminal EBNF is ideally suited for being the basis for a recursive-descent parser, because EBNF minimizes the number of nonterminals Copyright © 2015 Pearson. All rights reserved.

Recursive-Descent Parsing (continued)
A grammar for simple expressions: <expr>  <term> {(+ | -) <term>} <term>  <factor> {(* | /) <factor>} <factor>  id | int_constant | ( <expr> ) Copyright © 2015 Pearson. All rights reserved.

Assume we have a lexical analyzer named lex, which puts the next token code in nextToken The coding process when there is only one RHS: For each terminal symbol in the RHS, compare it with the next input token; if they match, continue, else there is an error For each nonterminal symbol in the RHS, call its associated parsing subprogram Copyright © 2015 Pearson. All rights reserved.

/* Function expr Parses strings in the language generated by the rule: <expr> → <term> {(+ | -) <term>} */ void expr() { /* Parse the first term */ term(); /* As long as the next token is + or -, call lex to get the next token and parse the next term */ while (nextToken == ADD_OP || nextToken == SUB_OP){ lex(); term(); } } Copyright © 2015 Pearson. All rights reserved.

A nonterminal that has more than one RHS requires an initial process to determine which RHS it is to parse The correct RHS is chosen on the basis of the next token of input (the lookahead) The next token is compared with the first token that can be generated by each RHS until a match is found If no match is found, it is a syntax error Copyright © 2015 Pearson. All rights reserved.

/* term Parses strings in the language generated by the rule: <term> -> <factor> {(* | /) <factor>) */ void term() { /* Parse the first factor */ factor(); /* As long as the next token is * or /, next token and parse the next factor */ while (nextToken == MULT_OP || nextToken == DIV_OP) { lex(); } } /* End of function term */ Copyright © 2015 Pearson. All rights reserved.

/* Function factor Parses strings in the language generated by the rule: <factor> -> id | (<expr>) */ void factor() { /* Determine which RHS */ if (nextToken) == ID_CODE || nextToken == INT_CODE) /* For the RHS id, just call lex */ lex(); /* If the RHS is (<expr>) – call lex to pass over the left parenthesis, call expr, and check for the right parenthesis */ else if (nextToken == LP_CODE) { lex(); expr(); if (nextToken == RP_CODE) lex(); else error(); } /* End of else if (nextToken == ... */ else error(); /* Neither RHS matches */ } Copyright © 2015 Pearson. All rights reserved.

- Trace of the lexical and syntax analyzers on (sum + 47) / total Next token is: 25 Next lexeme is ( Next token is: 11 Next lexeme is total Enter <expr> Enter <factor> Enter <term> Next token is: -1 Next lexeme is EOF Enter <factor> Exit <factor> Next token is: 11 Next lexeme is sum Exit <term> Enter <expr> Exit <expr> Enter <term> Enter <factor> Next token is: 21 Next lexeme is + Exit <factor> Exit <term> Next token is: 10 Next lexeme is 47 Next token is: 26 Next lexeme is ) Exit <expr> Next token is: 24 Next lexeme is / Copyright © 2015 Pearson. All rights reserved.

The LL Grammar Class The Left Recursion Problem If a grammar has left recursion, either direct or indirect, it cannot be the basis for a top-down parser A grammar can be modified to remove direct left recursion as follows: For each nonterminal, A, Group the A-rules as A → Aα1 | … | Aαm | β1 | β2 | … | βn where none of the β‘s begins with A 2. Replace the original A-rules with A → β1A’ | β2A’ | … | βnA’ A’ → α1A’ | α2A’ | … | αmA’ | ε Copyright © 2015 Pearson. All rights reserved.

The other characteristic of grammars that disallows top-down parsing is the lack of pairwise disjointness The inability to determine the correct RHS on the basis of one token of lookahead Def: FIRST() = {a |  =>* a } (If  =>* ,  is in FIRST()) Copyright © 2015 Pearson. All rights reserved.

Pairwise Disjointness Test: For each nonterminal, A, in the grammar that has more than one RHS, for each pair of rules, A  i and A  j, it must be true that FIRST(i) ⋂ FIRST(j) =  Example: A  a | bB | cAb A  a | aB Copyright © 2015 Pearson. All rights reserved.

Left factoring can resolve the problem Replace <variable>  identifier | identifier [<expression>] with <variable>  identifier <new> <new>   | [<expression>] or <variable>  identifier [[<expression>]] (the outer brackets are metasymbols of EBNF) Copyright © 2015 Pearson. All rights reserved.

Bottom-up Parsing The parsing problem is finding the correct RHS in a right-sentential form to reduce to get the previous right-sentential form in the derivation Copyright © 2015 Pearson. All rights reserved.

Bottom-up Parsing (continued)
Intuition about handles: Def:  is the handle of the right sentential form  = w if and only if S =>*rm Aw =>rm w Def:  is a phrase of the right sentential form  if and only if S =>*  = 1A2 =>+ 12 Def:  is a simple phrase of the right sentential form  if and only if S =>*  = 1A2 => 12 Copyright © 2015 Pearson. All rights reserved.

Intuition about handles (continued): The handle of a right sentential form is its leftmost simple phrase Given a parse tree, it is now easy to find the handle Parsing can be thought of as handle pruning Copyright © 2015 Pearson. All rights reserved.

Shift-Reduce Algorithms Reduce is the action of replacing the handle on the top of the parse stack with its corresponding LHS Shift is the action of moving the next token to the top of the parse stack Copyright © 2015 Pearson. All rights reserved.

Advantages of LR parsers: They will work for nearly all grammars that describe programming languages. They work on a larger class of grammars than other bottom-up algorithms, but are as efficient as any other bottom-up parser. They can detect syntax errors as soon as it is possible. The LR class of grammars is a superset of the class parsable by LL parsers. Copyright © 2015 Pearson. All rights reserved.

LR parsers must be constructed with a tool Knuth’s insight: A bottom-up parser could use the entire history of the parse, up to the current point, to make parsing decisions There are only a finite and relatively small number of different parse situations that could have occurred, so the history could be stored in a parser state, on the parse stack Copyright © 2015 Pearson. All rights reserved.

LR parsers are table driven, where the table has two components, an ACTION table and a GOTO table The ACTION table specifies the action of the parser, given the parser state and the next token Rows are state names; columns are terminals The GOTO table specifies which state to put on top of the parse stack after a reduction action is done Rows are state names; columns are nonterminals Copyright © 2015 Pearson. All rights reserved.

Structure of An LR Parser
Copyright © 2015 Pearson. All rights reserved.

Initial configuration: (S0, a1…an$) Parser actions: For a Shift, the next symbol of input is pushed onto the stack, along with the state symbol that is part of the Shift specification in the Action table For a Reduce, remove the handle from the stack, along with its state symbols. Push the LHS of the rule. Push the state symbol from the GOTO table, using the state symbol just below the new LHS in the stack and the LHS of the new rule as the row and column into the GOTO table Copyright © 2015 Pearson. All rights reserved.

Parsing From: Chapter 3 of Compilers: Principles, Techniques and Tools, Aho, et al., Addison-Wesley Prof. Steven A. Demurjian Computer Science & Engineering Department The University of Connecticut 371 Fairfield Way, Unit 2155 Storrs, CT (860)

Basic Intuition Use a sliding window over input to Lookahead Benefit
Reveal the input Slowly A little bit at a time 1 token at a time Maybe 2 at a time ? But systematically Left to Right What kind of derivation can use tokens in this way? The Input

Overall Top-Down Algorithm
Data structures A stack Holds the symbols to treat A lookahead window to choose a prediction Algorithm Startup Initialize stack to start symbol (a non-terminal) Initialize lookahead window at start of token stream Recursive Process Find out if front of window and top of stack match If match Consume the symbols No match Pop / Select / Push

How Could Algorithm Work?
Grammar: E  TE’ E’  +TE’ |  T  id Input : id + id $ Derivation: E  TE’  idE’  id +TE’  id +idE’  id +id’ Processing Stack: Contains the Derivation plus $ E $ T E’ $ id E’ $ E’ $ + T E’ $ T E’ $ id E’ $ E’ $ $

The Lookahead Window How large ? Very good question.
The art of counting 1,2, a lot. Conclusion Usually, 1 token is plenty Occasionally, 2 tokens may be needed Beyond 2 is wild In theory Any finite constant will do 1,2,3....,k

Lookahead Size and Languages
With k tokens of lookahead Some languages can be parsed (this way) Some languages cannot be parsed What about k+1 tokens ? More languages can be parsed.... So the set of languages recognized with k is a subset of the set of languages recognized with k+1! We have a hierarchy! Still... In practice LL(1) should be enough.

LL(1) Grammars for Top-Down Parsing
L : Scan input from Left to Right L : Construct a Leftmost Derivation 1 : Use “1” input symbol as lookahead in conjunction with stack to decide on the parsing action LL(1) grammars have no multiply-defined entries in the parsing table. Properties of LL(1) grammars: Grammar can’t be ambiguous or left recursive Grammar is LL(1) when A  1.  &  do not derive strings starting with the same terminal a 2. Either  or  can derive , but not both. Note: It may not be possible for a grammar to be manipulated into an LL(1) grammar

Top-Down Parsing Identify a leftmost derivation for an input string
Why ? By always replacing the leftmost non-terminal symbol via a production rule, we are guaranteed of developing a parse tree in a left-to-right fashion consistent with scanning the input. A  aBc  adDc  adec (scan a, scan d, scan e, scan c - accept!) Recursive-descent parsing concepts Predictive parsing Recursive / Brute force technique non-recursive / table driven Error recovery Implementation

Recursive Descent Parsing Concepts
General category of Parsing Top-Down Choose production rule based on input symbol May require backtracking to correct a wrong choice. Example: S  c A d A  ab | a input: cad cad S c d A a b Problem: backtrack cad S c d A S cad S c d A a b cad S c d A a cad c d A a

Predictive Parsing : Recursive
To eliminate backtracking, grammar must have: no left recursion apply left factoring remove -moves If so, we can utilize current input symbol in conjunction with non-terminals to be expanded to uniquely determine the next action Utilize transition diagrams (TDs): For each non-terminal of the grammar: Create an initial and final state If A X1X2…Xn is a production, add path with edges X1, X2, … , Xn TDs can be algorithmized into Program

Transition Diagrams (TDs)
Unlike lexical equivalents, each edge represents a token Transition implies: if token, choose edge, call proc Recall earlier grammar and its associated TDs E  TE’ E’  + TE’ |  T  FT’ T’  * FT’ |  F  ( E ) | id 2 1 T E’ E: How are transition diagrams used ? Are -moves a problem ? Can we simplify transition diagrams ? Why is simplification critical ? 3 6 + 4 T E’: 5 E’  7 9 8 F T’ T: 10 13 * 11 F T’: 12 T’  14 17 ( 15 E F: 16 ) id

Motivating Table-Driven Parsing
1. Left to right scan input 2. Find leftmost derivation Grammar: E  TE’ E’  +TE’ |  T  id Terminator Input : id + id $ Derivation: E  TE’  idE’  id +TE’  id +idE’  id +id’ Processing Stack: Contains the Derivation plus $ E $ T E’ $ id E’ $ E’ $ + T E’ $ T E’ $ id E’ $ E’ $ $

Non-Recursive / Table Driven
+ b $ Y X Z Input Predictive Parsing Program Stack Output Parsing Table M[A,a] (String + terminator) NT + T symbols of CFG What actions parser should take based on stack / input Empty stack symbol General parser behavior: X : top of stack a : current input 1. When X=a = $ halt, accept, success 2. When X=a  $ , POP X off stack, advance input, go to 1. 3. When X is a non-terminal, examine M[X,a] if it is an error  call recovery routine if M[X,a] = {X  UVW}, POP X, PUSH W,V,U in reverse order DO NOT expend any input

Algorithm for Non-Recursive Parsing
Set ip to point to the first symbol of w$; repeat let X be the top stack symbol and a the symbol pointed to by ip; if X is terminal or $ then if X=a then pop X from the stack and advance ip else error() else /* X is a non-terminal */ if M[X,a] = XY1Y2…Yk then begin pop X from stack; push Yk, Yk-1, … , Y1 onto stack, with Y1 on top output the production XY1Y2…Yk end until X=$ /* stack is empty */ Input pointer May also execute other code based on the production used

Example Our well-worn example ! Table M E  TE’ E’  + TE’ | 
T  FT’ T’  * FT’ |  F  ( E ) | id Our well-worn example ! Table M Non-terminal INPUT SYMBOL id + * ( ) $ E E’ T T’ F ETE’ TFT’ Fid E’+TE’ T’ T’*FT’ F(E) E’

Trace of Example Expend Input STACK INPUT OUTPUT $E $E’T $E’T’F
$E’T’id $E’T’ $E’ $E’T+ $E’T’F* $ id + id * id$ + id * id$ id * id$ * id$ id$ $ E TE’ T FT’ F  id T’   E’  +TE’ T’  *FT’ E’   Expend Input

Leftmost Derivation for the Example
The leftmost derivation for the example is as follows: E  TE’  FT’E’  id T’E’  id E’  id + TE’  id + FT’E’  id + id T’E’  id + id * FT’E’  id + id * id T’E’  id + id * id E’  id + id * id

Bottom-Up Parsing Basic Intuition and Motvation
Recall that LL(k) works TOP-DOWN With a LEFTMOST Derivation Predicts the right production to select based on lookahead Our new motto LR(k) works BOTTOM-UP With a RIGHTMOST Derivation in REVERSE Commits to the production choice after seeing the whole body (right hand side of rule), working in “reverse”

Bottom-Up Parsing Inverse or Complement of Top-Down Parsing
Top Down Parsing Utilizes “Start Symbol” and Attempts to Derive the Input String using Productions Bottom-Up Parsing Makes Modifications to the Input String which Allows it to Reduce to Start Symbol For Example, Consider Grammar & Derivations: S  a A B e A  Abc | b B  d What Does Each Derivation Represent? Top-Down Leftmost Derivation Bottom-Up ---- Rightmost Derivation in Reverse! abbcde  aAbcde  aAde  aABe  S  S  aABe  aAbcBe  abbcBe  abbcde

Bottom-Up Parsing Top-Down Derivation Bottom-Up in Reverse
S  aABe  aAbcBe  abbcBe  abbcde S  a A B e A  Abc A  b B  d abbcde  aAbcde  aAde  aABe  S  A  b A  Abc B  d S  a A B e

Type of Derviation Grammar: S  a A B e A  Abc | b B  d Key Issues:
How do we Determine which Substring to “Reduce”? How do we Know which Production Rule to Use? What is the General Processing for BUP? How are Conflicts Resolved? What Types of BUP are Considered? TDP: S  aABe  aAbcBe  abbcBe  abbcde BUP: S  aABe  aAde aAbcde  abbcde

Illustrating BUP Grammar: S  a A B e A  Abc | b B  d
Rules are Processed in Reverse A  b A  Abc B  d S  a A B e BUP: S  aABe  aAde aAbcde  abbcde Steps

Identify Right Hand Side of Rule
Consider again... S  aABe  aAde  aAbcde  abbcde S → aABe A → Abc | b B → d

What bottom-up really means... abbcde aAbcde

aAbcde aAde

aAde aABe

aABe S

Bottom-Up Parsing … Recognized body (RHS of rule ) of last production applied in rightmost derivation Replace the symbol sequence (RHS of rule) of that body by the LHS of the Production Rule Based on “Current” Input Repeats At the end Either We are left with the start symbol  Success! Or We get “stuck” somewhere  Syntax error! Key Issue: If there are Multiple Handles for the “Same” Sentential Form, then the Grammar G is Ambiguous

General Processing of BUP
Basic mechanisms “Shift” “Reduce” Basic data-structure A stack of grammar symbols (Terminals and Non-Terminals) Basic idea Shift input symbols on the stack until ... the entire handle (RHS of rule) of the last rightmost reduction When the body of the last RM reduction is on Stack, Reduce it by replacing the body by the right-hand-side of the Production Rule When only start symbol is left We are done.

Grammar rules Number each rule 1 S → aABe 2 A → Abc 3 A → b 4 B→ d

Example $ abbcde$ Shift a $a bbcde$ Shift b $ab bcde$ Reduce 3 $aA
1 S → aABe 2 A → Abc 3 A → b 4 B→ d Example $ abbcde$ Shift a $a bbcde$ Shift b $ab bcde$ Reduce 3 $aA Shift $aAb cde$ $aAbc de$ Reduce $aAd e$ $aAB $aABe $S Accept Rule to Reduce with (LHS of rule 3) Handle (RHS of rule)

Shift c $aAbc de$ Reduce 2 Shift $aAd e$ Reduce $aAB $aABe $S Accept Handle Rule 2 to Reduce

Shift c $aAbc de$ Reduce 2 Shift d $aAd e$ Reduce $aAB Shift $aABe $S Accept Handle Rule 4 to Reduce

Shift c $aAbc de$ Reduce 2 Shift d $aAd e$ Reduce 4 $aAB Shift e $aABe Reduce $S Accept

Shift c $aAbc de$ Reduce 2 Shift d $aAd e$ Reduce 4 $aAB Shift e $aABe Reduce 1 $S Accept 1 S → aABe 2 A → Abc 3 A → b 4 B→ d

Key Observation At any point in time
Content of the stack is a prefix of a right-sentencial form This prefix is called a viable prefix Check again! Below = all the right-sentencial form of a rightmost derivation S  aABe  aAde  aAbcde  abbcde $ $a $ab $aA $aAb $aAbc $aAd $aAB $aABe $S

What is General Processing for BUP?
Utilize a Stack Implementation: Contains Symbols, Non-Terminals, and Input Input is Examined w.r.t. Stack/Current State General Operation: Options to Process Stack Include: Shift Symbols from Input onto Stack When Handle β on Top of Stack Reduce by using Rule: A  β Pop all Symbols of Handle β (RHS of rule) Push Non-Terminal A (LHS of rule) onto Stack When Configuration ($S, $) of Stack, ACCEPT Error Occurs when Handle Can’t be Found or S is on Stack with Non-Empty Input

Consider the Example Below

What are Possible Grammar Conflicts?
Shift-Reduce (S/R) Conflict: Content of Stack and Reading Current Input More than One Option of What to do Next stmt  if expr then stmt | if expr then stmt else stmt | other Consider Stack as below with input of token else $ …. if expr then stmt Do we Reduce “if expr then stmt” to stmt This assumes that there is an earlier if to match the upcoming else Do we Shift “else” onto Stack? This means that the else matches to the most recent if Both can fail based on input yet to be seen

What are Possible Grammar Conflicts?
Reduce-Reduce (R/R) Conflict: stmt  id ( parameter_list ) parameter_list  parameter_list, parameter parameter  id expr  id ( expression_list ) | id expression_list  expression_list, expr | expr Consider Stack as below with input of token $ …. id (id, … , id) …. Do we Reduce to stmt? Do we Reduce to expr?

Bottom-Up Parsing Techniques
LR(k) Parsers Left to Right Input Scanning (L) Construct a Rightmost Derivation in Reverse (R) Use k Lookahead Symbols for Decisions Advantages Well Suited to Almost All PLs Most General Approach/Efficiently Implemented Detects Syntax Errors Very Quickly Disadvantages Difficult to Build by Hand Tools to Assist Parser Construction (Yacc, Bison)

Components of an LR Parser
Table Generator Grammar Parsing Table Driver Routines Parsing Table Output Parse Tree Input Tokens Differs Based on Grammar/Lookaheads Common to all LR Parsers

Three Classes of LR Parsers
Simple LR (SLR) or LR(0) Easiest but Limited in Grammar Applicability Grammar Leads to S/R and R/R Conflicts Canonical LR(k) k lookaheads Powerful but Expensive LR(k) – Usually LR(1) Lookahead LR (LALR) – In Between Two Two Fold Focus: Parser Table Construction – Item and Item Sets Examination of LR Parsing Algorithm

LR Parser Structure a1 ... ai ai an$ sm Xm sm-1 Xm-1 … X1 s0 INPUT (s0 X1 s1 X2 ... Xm-1 sm-1 Xm sm , ai ai an $) O U T P Grammar symbol (Terminal or non-terminal) LR Parsing Program state action goto action[sm , ai ] is Parsing Table with Four Options 1. Shift S onto Stack 2. Reduce by Rule 3. Accept ($,$) 4. Report an Error goto[sm , ai ] determines next state for action Question: What does following Represent? Right sentential form X1 X2 ... Xm-1 Xm ai ai an

What is the Parsing Table?
Combination of State, Action, and Goto Shift s5 means shift input symbol and state 5 Reduce r2 means reduce using rule 2 goto state/NT indicates the next state for reduction

Actions Against Configuration
Configuration: (s0 X1 s1 X2 ... Xm-1 sm-1 Xm sm , ai ai an $) action[sm , ai ] = Shift s in Parsing Table – Move aism+1 to Stack (s0 X1 s1 X2 ... Xm-1 sm-1 Xm sm ai sm+1 , ai an $) Reduce A  β means Remove 2×| β| symbols from stack and Push A along with state s = goto[sm-1 , A] onto stack Uses Prior State after popping to determine goto Accept – Parsing Complete Error – Call recovery Routine

How Does BUP Work? Stack Input Action

Another Detailed Example

Parser Generators The entire process we describe can be automated
Computation of the machine states Computation of the lookaheads Computation of the action and goto tables Optimization of the LALR tables. Therefore... Tools exist to do this for you!

Parser Generators II Table-driven leftmost In the C/C++ world
Most famous parser generator YACC LALR(1) Most used parser generator BISON LALR(1) Table-driven leftmost PCCTS LL(k) In the Java world Several alternatives CUP (a BISON/YACC lookalike) LALR(1) JACK LALR(1)

Big Picture

Summary Syntax analysis is a common part of language implementation
A lexical analyzer is a pattern matcher that isolates small-scale parts of a program Detects syntax errors Produces a parse tree A recursive-descent parser is an LL parser EBNF Parsing problem for bottom-up parsers: find the substring of current sentential form The LR family of shift-reduce parsers is the most common bottom-up parsing approach Copyright © 2015 Pearson. All rights reserved.

Lexical and Syntax Analysis

Similar presentations

Presentation on theme: "Lexical and Syntax Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lexical and Syntax Analysis

Similar presentations

Presentation on theme: "Lexical and Syntax Analysis"— Presentation transcript:

Similar presentations

About project

Feedback