Chapter 2 Syntax A language that is simple to parse for the compiler is also simple to parse for the human programmer. N. Wirth.

Chapter 2 Syntax A language that is simple to parse for the compiler is also simple to parse for the human programmer. N. Wirth

2.1 Grammars 2.1.1 Backus-Naur Form 2.1.2 Derivations 2.1.3 Parse Trees 2.1.4 Associativity and Precedence 2.1.5 Ambiguous Grammars 2.2 Extended BNF 2.3 Syntax of a Small Language: Clite 2.3.1 Lexical Syntax 2.3.2 Concrete Syntax 2.4 Compilers and Interpreters 2.5 Linking Syntax and Semantics 2.5.1 Abstract Syntax 2.5.2 Abstract Syntax Trees 2.5.3 Abstract Syntax of Clite

 Motivation for using a subset of C: Grammar Language (pages) Reference Pascal5Jensen & Wirth C 6Kernighan & Richie C++22Stroustrup Java14Gosling, et. al.  The Clite grammar fits on one page (next 3 slides), so it’s a far better tool for studying language design.

Program  int main ( ) { Declarations Statements } Declarations  { Declaration } Declaration  Type Identifier [ [ Integer ] ] {, Identifier [ [ Integer ] ] } ; Type  int | bool | float | char Statements  { Statement } Statement  ; | Block | Assignment | IfStatement | WhileStatement Block  { Statements } Assignment  Identifier [ [ Expression ] ] = Expression ; IfStatement  if ( Expression ) Statement [ else Statement ] WhileStatement  while ( Expression ) Statement

Expression  Conjunction { || Conjunction } Conjunction  Equality { && Equality } Equality  Relation [ EquOp Relation ] EquOp  == | != Relation  Addition [ RelOp Addition ] RelOp  | >= Addition  Term { AddOp Term } AddOp  + | - Term  Factor { MulOp Factor } MulOp  * | / | % Factor  [ UnaryOp ] Primary UnaryOp  - | ! Primary  Identifier [ [ Expression ] ] | Literal | ( Expression ) | Type ( Expression )

Identifier  Letter { Letter | Digit } Letter  a | b |... | z | A | B |... | Z Digit  0 | 1 |... | 9 Literal  Integer | Boolean | Float | Char Integer  Digit { Digit } Boolean  true | false Float  Integer. Integer Char  ‘ ASCII Char ‘ (ASCII Char is the set of ASCII characters)

13 grammar rules – compare to 4 pages, for C++ Metabraces { }(0 or more) is interpreted to mean left associativity; e.g., ◦ Addition → Term { AddOp Term } AddOp → + | - Metabrackets [ ] (optional) means an addition can only be followed by one or no relational operators plus another addition. Relation  Addition [ RelOp Addition ] ◦ RelOp  | >= ◦ (no a > b > c for example)

Comments The significance of whitespace Distinguishing one token <= from two tokens < = Distinguishing identifiers from keywords like if

 The Clite grammar has two levels ◦ lexical level (described by lexical syntax) ◦ syntactic level(described by concrete syntax)  They correspond to two separate parts of a compiler.  The issues on the previous slide are lexical issues.

 Examples of lexical entities (tokens): ◦ Identifiers e.g., numbr1, X ◦ Literalse.g., 123, 'x', 3.25, true ◦ Keywords bool char else false float if int main true while ◦ Operators = || && == != >= + - * / ! % ◦ Punctuation ;, { } ( ) ◦ Chare.g., ‘?’

 Whitespace is any space, tab, end-of-line character (or characters), or character sequence inside a comment  No token may contain embedded whitespace  (unless it is a character or string literal)  Example: >= one token > = two tokens

 while ( a <= b) legal - spacing between tokens  while(a<=b) also legal - spacing not needed  while (a < = b) no lexical errors but illegal syntactically – lexer would identify tokens while, (, a, <, =, b, )

 Clite uses // comment style of C++  Not defined in Clite grammar (but could be)  Instead, it’s defined outside the grammar  The use of whitespace to differentiate between one and two character operators is also defined outside the grammar.

Sequence of letters and digits, starting with a letter  “if” is an identifier which also is a keyword Keywords versus reserved words: ◦ Keyword: predefined by the language ◦ Reserved word: can only be used as defined. ◦ In most languages all keywords are also reserved, but in a few; e.g., Pascal, a subset of the keyword identifiers are predefined but not reserved (and can be redefined by the programmer). Implications? Flexibility, confusion …

 Concrete syntax of a language is the set of rules for writing correct programs  The structure of a specific program can be represented by a parse tree, based on the concrete syntax of the language, using the stream of Tokens identified during lexical analysis  The root of the parse tree is the Start Symbol of the language (Program, in Clite).

Clite’s expression rules are non- ambiguous with respect to precedence and associativity ◦ Rule ordering defines precedence; rule format defines associativity. C/C++ expression grammar definition is ambiguous – precedence and associativity are specified separately.

Clite OperatorAssociativity Unary - ! none * / left + - left >= none (i.e., no a < b <= c ) == != none && left || left

 … are non-associative. (an idea borrowed from Ada)  Why is this important? In C & C++, the expression: if (10 < x < 20) is not equivalent to if (10 < x && x < 20) But it is error-free! So, what does it mean?

 Grammar rules don’t specify the operand types to be used with various operators; e.g., is true + 13 a legal expression? What is the type of the expression 123.78 + 37 ?  These are type and semantic issues, not lexical or syntax.

Lexical Analyzer (lexer) Syntactic Analyzer (parser) Semantic Analyzer Code Optimizer Code Generator Tokens Abstract Syntax Machine Code Intermediate Code (IC) Source Program Intermediate Code (IC)

 Input: characters (the program)  Output: tokens & token type  Lexical grammars are simpler than syntax grammars  Often generated automatically by lexical analyzer generating programs

Often based on BNF/EBNF grammar Input: tokens Output: abstract syntax tree or some other representation of the program Abstract syntax: similar to a concrete parse tree but with punctuation, many nonterminals discarded

Typical tasks: ◦ Check that all identifiers are declared ◦ Perform type checking for expressions, assignments, … ◦ Insert implied conversion operators (i.e., make them explicit) Context free grammars can’t express the semantic rules that are needed for this phase of translation. Output: Intermediate code (IC) tree, modified abstract syntax tree representation.

 Purpose: Improve the run-time performance of the object code ◦ Usually, to make it run faster ◦ Other possibilities: reduce amount of storage required  Drawback: optimization is time- consuming; slows down debugging  Output: Intermediate code, similar to abstract syntax notation; closer to machine code

Evaluate constant expressions at compile- time In-line expansion (of function calls) Loop unrolling Reorder code to improve cache performance Eliminate common sub-expressions Eliminate unnecessary code Store local variables/intermediate results in registers rather than on the stack or elsewhere in memory 

Output: machine code Instruction selection, register management “Peephole” optimization: look at a small segment of machine code, make it more efficient; e.g., ◦ x = y; →→ load y, R0 store R0, x z = x * 2; load x, R0 // redundant code mul R0, #2

 Replaces last 2 phases of a compiler with direct execution  Input: ◦ Mixed: generates & uses intermediate code/abstract syntax ◦ Pure: start from stream of ASCII characters each time a statement is executed  Mixed interpreters ◦ Java, Perl, Python, Haskell, Scheme  Pure interpreters: ◦ most Basics, shell commands

 Source code: a = x + y;  Compiler-generated object code: load r0, x; add r0, y; store r0, a;  Will be executed later, with remainder of program  Interpreter: Call an interpretive routine to actually perform the operation. e.g., add(x, y, a);

 It’s not the case that the lexical analyzer identifies all the tokens, and then the parser analyzes all the tokens, and then the type/semantic analysis is performed.  Instead, parser repeatedly contacts lexer to get another token  As tokens are received they either do or don’t match the expected syntax ◦ If there’s a match, perform any type or semantic testing, possibly generate int. code, call for another token.

 Output of parser: the concrete parse tree is large – probably more than needed for next phase ◦ The compiler usually produces some more compact representation of the program  Example: Fig. 2.9 (page 46)

 The shape of the parse tree reveals the meaning of the program.  So as output of syntax analysis we want a tree that removes its inefficiency and keeps its meaning. ◦ Remove separator/punctuation terminal symbols ◦ Remove all trivial nonterminals ◦ Replace remaining nonterminals with leaf terminals  Example: Fig. 2.10

Removes unnecessary details but keeps the essential language elements; e.g., consider the following two equivalent loops: PascalC++ while j < n do beginwhile (j < n) { j := j + 1; j = j + 1; end;} Essential information: 1) it is a loop, 2) its terminating condition is j >= n, and 3) its body increments the current value of j.

Purpose: an intermediate form of the source code Generated by the parser during syntax analysis Used during type checking/semantic analysis Abstract syntax rules are defined by the compiler (or interpreter). One concrete syntax can have several abstract syntaxes associated with it, depending on the design of the translator.

 LHS = RHS  LHS names an abstract syntax class  RHS either  (1) gives a list of one or more alternatives  or  (2) lists the essential elements of the syntax class  Compare to production rules in concrete grammar

Concrete: Assignment  Identifier [ [ Expression ] ] = Expression ; Abstract: Assignment = Variable target ; Expression source

target source Binary Operator Variable Value + 2 y * x z

op term1 term2 Binary node opterm Unary node Binaries and unaries represent information that can be used for later processing

abstract class Expression { } abstract class VariableRef extends Expression { } class Variable extends VariableRef { String id; } class ArrayRef extends VariableRef { String id; Expression index} class Value extends Expression { … } class Binary extends Expression { Operator op; Expression term1, term2; } class Unary extends Expression { UnaryOp op; Expression term; }

 Lexical syntax: small, simple, defines language tokens  Concrete syntax: detailed, specific, defines correct programs, used to direct parsing algorithms (language specific, not implementation specific)  Abstract syntax: simpler than concrete, used to describe the structure of the intermediate code (implementation specific, not language specific) NOT INTENDED TO BE USED FOR PARSING  Semantics: program “meaning”, or runtime behavior

 A syntax is ambiguous if a portion of a program has two or more possible interpretations (parse trees)  Non-ambiguous grammars can be written but some ambiguity may be tolerated to reduce grammar size.  Operator associativity and precedence can be defined by the concrete syntax  Compilers generate code for later execution while interpreters execute program statements as they are analyzed.

Chapter 2 Syntax A language that is simple to parse for the compiler is also simple to parse for the human programmer. N. Wirth.

Similar presentations

Presentation on theme: "Chapter 2 Syntax A language that is simple to parse for the compiler is also simple to parse for the human programmer. N. Wirth."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 2 Syntax A language that is simple to parse for the compiler is also simple to parse for the human programmer. N. Wirth.

Similar presentations

Presentation on theme: "Chapter 2 Syntax A language that is simple to parse for the compiler is also simple to parse for the human programmer. N. Wirth."— Presentation transcript:

Similar presentations

About project

Feedback