Course Outline Translators and Compilers Major Programming Project

CSC 415: Translators and Compilers
Dr. Chuck Lillie

Course Outline Translators and Compilers Major Programming Project
Language Processors Compilation Syntactic Analysis Contextual Analysis Run-Time Organization Code Generation Interpretation Major Programming Project Project Definition and Planning Implementation Weekly Status Reports Project Presentation

Project Implement a Compiler for the Programming Language Triangle
Appendix B: Informal Specification of the Programming Language Triangle Appendix D: Class Diagrams for the Triangle Compiler Present Project Plan What and How Weekly Status Reports Work accomplished during the reporting period Deliverable progress, as a percentage of completion Problem areas Planned activities for the next reporting period

Chapter 1: Introduction to Programming Languages
Programming Language: A formal notation for expressing algorithms. Programming Language Processors: Tools to enter, edit, translate, and interpret programs on machines. Machine Code: Basic machine instructions Keep track of exact address of each data item and each instruction Encode each instruction as a bit string Assembly Language: Symbolic names for operations, registers, and addresses.

Programming Languages
High Level Languages: Notation similar to familiar mathematical notation Expressions: +, -, *, / Data Types: truth variables, characters, integers, records, arrays Control Structures: if, case, while, for Declarations: constant values, variables, procedures, functions, types Abstraction: separates what is to be performed from how it is to be performed Encapsulation (or data abstraction): group together related declarations and selectively hide some

Programming Languages
Any system that manipulates programs expressed in some particular programming language Editors: enter, modify, and save program text Translators and Compilers: Translates text from one language to another. Compiler translates a program from a high-level language to a low-level language, preparing it to be run on a machine Checks program for syntactic and contextual errors Interpreters: Runs program without compliation Command languages Database query languages

Programming Languages Specifications
Syntax Form of the program Defines symbols How phrases are composed Contextual constraints Scope: determine scope of each declaration Type: Semantics Meaning of the program

Representation Syntax Backus-Naur Form (BNF): context-free grammar
Terminal symbols (>=, while, ;) Non-terminal symbols (Program, Command, Expression, Declaration) Start symbol (Program) Production rules (defines how phrases are composed from terminals and sub-phrases) N::=a|b|…. Syntax Tree Used to define language in terms of strings and terminal symbols

Representation Semantics Abstract Syntax Abstract Syntax Tree
Concentrate on phrase structure alone Abstract Syntax Tree

Contextual Constraints
Scope Binding Static: determined by language processor Dynamic: determined at run-time Type Statically: language processor can detect all errors Dynamically: type errors cannot be detected until run-time Will assume static binding and statically typed

Semantics Concerned with meaning of program
Behavior when run Usually specified informally Declarative sentences Could include side effects Correspond to production rules

Chapter 2: Language Processors
Translators and Compilers Interpreters Real and Abstract Machines Interpretive Compilers Portable Compilers Bootstrapping Case Study: The Triangle Language Processor

Translators & Compilers
Translator: a program that accepts any text expressed in one language (the translator’s source language), and generates a semantically-equivalent text expressed in another language (its target language) Chinese-into-English Java-into-C Java-into-x86 X86 assembler

Assembler: translates from an assembly language into the corresponding machine code Generates one machine code instruction per source instruction Compiler: translates from a high-level language into a low-level language Generates several machine-code instructions per source command.

Disassembler: translates a machine code into the corresponding assembly language Decompiler: translates a low-level language into a high-level language Question: Why would you want a disassembler or decompiler?

Source Program: the source language text Object Program: the target language text Compiler Source Program Syntax Check Context Constraints Object Program Generate Object Code Semantic Analysis Object program semantically equivalent to source program If source program is well-formed

Why would you want to do: Java-into-C translator C-into-Java translator Assembly-language-into-Pascal decompiler

P = Program Name L = Implementation Language M M = Target Machine M P L For this to work, L must equal M, that is, the implementation language must be the same as the machine language S = Source Language S T L T = Target Language L = Translator’s Implementation Language S-into-T Translator is itself a program that runs on machine L

Translating a source program P Expressed in language T, Using an S-into-T translator Running on machine M

sort sort sort Java Java x86 x86 x86 x86 x86 x86 Translating a source program sort Expressed in language Java, Using an Java-into-x86 translator Running on an x86 machine The object program is running on the same machine as the compiler

sort sort sort Java Java PPC PPC PPC download x86 PPC x86 Translating a source program sort Expressed in language Java, Using an Java-into-PPC translator Running on an x86 machine Downloaded to a PPC machine Cross Compiler: The object program is running on a different machine than the compiler

sort sort x86 sort sort Java Java C C C x86 x86 x86 x86 Translating a source program sort Expressed in language Java, Using an Java-into-C translator Running on an x86 machine Then translating the C program Using an C-into x86 compiler Running on an x86 machine Into x86 object program Two-stage Compiler: The source program is translated to another language before being translated into the object program

Translator Rules Can run on machine M only if it is expressed in machine code M Source program must be expressed in translator’s source language S Object program is expressed in the translator’s target language T Object program is semantically equivalent to the source program

Interpreters Accepts any program (source program) expressed in a particular language (source language) and runs that source program immediately Does not translate the source program into object code prior to execution

Interpreters Interpreter Source Program Fetch Instruction
Analyze Instruction Program Complete Execute Instruction Source program starts to run as soon as the first instruction is analyzed

Interpreters When to Use Interpretation Disadvantages
Interactive mode – want to see results of instruction before entering next instruction Only use program once Each instruction expected to be executed only once Instructions have simple formats Disadvantages Slow: up to 100 times slower than in machine code

Interpreters Examples Basic Lisp Unix Command Language (shell) SQL

Interpreters S interpreter expressed in language L
Program P expressed in language S, using Interpreter S, running on machine M S M M Basic x86 graph Program graph written in Basic running on a Basic interpreter executed on an x86 machine

Real and Abstract Machines
Hardware emulation: Using software to execute one set of machine code on another machine Can measure everything about the new machine except its speed Abstract machine: emulator Real machine: actual hardware An abstract machine is functionally equivalent to a real machine if they both implement the same language L

Real and Abstract Machines
New Machine Instruction (nmi) interpreter written in C nmi C nmi interpreter expressed in machine code M nmi interpreter written in C The nmi interpreter is translated into machine code M using the C compiler nmi C nmi M C M M P nmi P nmi Compiler to translate C program into M machine code nmi M nmi M

Interpretive Compilers
Combination of compiler and interpreter Translate source program into an intermediate language It is intermediate in level between the source language and ordinary machine code Its instructions have simple formats, and therefore can be analyzed easily and quickly Translation from the source language into the intermediate language is easy and fast An interpretive compiles combines fast compilation with tolerable running speed

Interpretive Compilers
Java JVM M Java into JVM translator running on machine M JVM code interpreter running on machine M JVM M P JVM M Java JVM M P A Java program P is first translated into JVM-code, and then the JVM-code object program is interpreted

Portable Compilers A program is portable if it can be compiled and run on any machine, without change A portable program is more valuable than an unportable one, because its development cost can be spread over more copies Portability is measured by the proportion of code that remains unchanged when it is moved to a dissimilar machine Language affects protability Assembly language: 0% portable High level language: approaches 100% portability

Portable Compilers Language Processors
Valuable and widely used programs Typically written in high-level language Pascal, C, Java Part of language processor is machine dependent Code generation part Language processor is only about 50% portable Compiler that generates intermediate code is more portable than a compiler that generates machine code

Portable Compilers Rewrite interpreter in C
Java JVM Java JVM JVM Java Java JVM C Rewrite interpreter in C P JVM M JVM C JVM M P Java P JVM C M Java JVM M JVM M M Note: C M Compiler exists; rewrite JVM interpreter from Java to C

Bootstrapping The language processor is used to process itself
Implementation language is the source language Bootstrapping a portable compiler A portable compiler can be bootstrapped to make a true compiler – one that generates machine code – by writing an intermediate-language-into-machine-code translator Full bootstrap Writing the compiler in itself Using the latest version to upgrade the next version Half bootstrap Compiler expressed in itself but targeted for another machine Bootstrapping to improve efficiency Upgrade the compiler to optomize code generation as well as to improve compile efficiency

Bootstrapping Bootstrap an interpretive compiler to generate machine code Java M M JVM Java M JVM Java First, write a JVM-coded-into-M translator in Java Next, compile translator using existing interpreter Use translator to translate itself Java JVM M P Java JVM Java JVM M JVM M Translate Java-into-JVM-code translator into machine code Two stage Java-into-M compiler M

Bootstrapping Full bootstrap Ada-S M C Ada-S M C Ada-S M Ada-S M Ada M
v1 Ada-S M C v1 v2 Ada-S M C Ada-S M Write Ada-S compiler in C Convert the C version of Ada-S into Ada-S version of Ada-S Ada-S M v2 Ada M Ada-S v3 v1 v3 v2 Ada M Ada-S Extend Ada-S compiler to (full) Ada compiler

Bootstrapping Half bootstrap P P P Ada TM TM TM HM Ada HM Ada HM Ada

Bootstrapping Bootstrap to improve efficiency M P Ada Ada Ms Ada Ms
Mf v2 Ada Mf v2 Ada Mf Ms v2 Ada Ms v1 M P Ada Ada Mf Ms v2

Chapter 3: Compilation Phases Passes Case Study: The Triangle Compiler
Syntactic Analysis Contextual Analysis Code Generation Passes Multi-pass Compilation One-pass Compilation Compiler Design Issues Case Study: The Triangle Compiler

Phases Syntactic Analysis Contextual Analysis Code Generation
The source program is parsed to check whether it conforms to the source language’s syntax, and to determine its phrase structure Contextual Analysis The parsed program is analyzed to check whether it conforms to the source language's contextual constraints Code Generation The checked program is translated to an object program, in accordance with the semantics of the source and target languages

Phases Syntactic Analysis Contextual Analysis Code Generation
Source Program Syntactic Analysis Error Report AST Contextual Analysis Error Report Decorated AST Code Generation Object Program

Syntactic Analysis To determine the source program’s phrase structure
Parsing Contextual analysis and code generation must know how the program is composed Commands, expressions, declarations, … Check for conformance to the source language’s syntax Construct suitable representation of its phrase structure (AST) AST Terminal nodes corresponding to identifiers, literals, and operators Sub trees representing the phases of the source program Blanks and comments not in AST (no meaning) Punctuation and brackets not in AST (only separate and enclose)

Contextual Analysis Analyzes the parsed program Produces decorated AST
Scope rules Type rules Produces decorated AST AST with information gathered during contextual analysis Each applied occurrence of an identifier is linked ot the corresponding declaration Each expression is decorated by its type T

Code Generation The final translation of the checked program to an object program After syntactic and contextual analysis is completed Treatment of identifiers Constants Binds identifier to value Replace each occurrence of identifier with value Variables Binds identifier to some memory address Replace each occurrence of identifier by address Target language Assembly language Machine code

Passes Multi-pass compilation One-pass compilation
Traverses the program or AST several times Compiler Driver Syntactic Analyzer Contextual Analyzer Code Generator One-pass compilation Single traverse of program Contextual analysis and code generation are performed ‘on the fly’ during syntactic analysis Compiler Driver Syntactic Analyzer Contextual Analyzer Code Generator

Compiler Design Issues
Speed Compiler run time Space Storage: size of compiler + files generated Modularity Multi-pass compiler more modular than one-pass compiler Flexibility Multi-pass compiler is more flexible because it generates an AST that can be traversed in any order by the other phases Semantics-preserving transformations To optimize code – must have multi-pass compiler Source language properties May restrict compiler choice – some language constructs may require multi-pass compilers

Chapter 4: Syntactic Analysis
Sub-phases of Syntactic Analysis Grammars Revisited Parsing Abstract Syntax Trees Scanning Case Study: Syntactic Analysis in the Triangle Compiler

Structure of a Compiler
Lexical Analyzer Source code tokens Symbol Table Parser & Semantic Analyzer parse tree Intermediate Code Generation intermediate representation Optimization intermediate representation Assembly Code Generation Assembly code

Syntactic Analysis Main function
Parse source program to discover its phrase structure Recursive-descent parsing Constructing an AST Scanning to group characters into tokens

Sub-phases of Syntactic Analysis
Scanning (or lexical analysis) Source program transformed to a stream of tokens Identifiers Literals Operators Keywords Punctuation Comments and blank spaces discarded Parsing To determine the source programs phrase structure Source program is input as a stream of tokens (from the Scanner) Treats each token as a terminal symbol Representation of phrase structure AST

Lexical Analysis – A Simple Example
Scan the file character by character and group characters into words and punctuation (tokens), remove white space and comments Some tokens for this example: main ( ) { int a , b c ; Main() { int a, b, c; char number[5]; /* get user inputs */ A = atoi ( gets(number)); B = atoi (gets(number)); /* calculate value for c */ C = 2*(a+b) + a*(a+b); /* print results */ Printf(“%d”,c); }

Creating Tokens – Mini-Triangle Example
let var y: Integer in !new year y := y+1 Buffer Input Converter character string (S = space) l e t S v a r S y : S I n t e g e r S i n Scanner let var Ident. colon Ident. in Ident. becomes Ident. op. Intlit. eot let var y : Integer in y := y + 1

Tokens in Triangle // literals, identifiers, operators...
INTLITERAL = 0, "<int>", CHARLITERAL = 1, "<char>", IDENTIFIER = 2, "<identifier>", OPERATOR = 3, "<operator>", // reserved words - must be in alphabetical order... ARRAY = 4, "array", BEGIN = 5, "begin", CONST = 6, "const", DO = 7, "do", ELSE = 8, "else", END = 9, "end", FUNC = 10, "func", IF = 11, "if", IN = 12, "in", LET = 13, "let", OF = 14, "of", PROC = 15, "proc", RECORD = 16, "record", THEN = 17, "then", TYPE = 18, "type", VAR = 19, "var", WHILE = 20, "while", // punctuation... DOT = 21, ".", COLON = 22, ":", SEMICOLON = 23, ";", COMMA = 24, ",", BECOMES = 25, "~", IS = 26, // brackets... LPAREN = 27, "(", RPAREN = 28, ")", LBRACKET = 29, [", RBRACKET = 30, "]", LCURLY = 31, "{", RCURLY = 32, "}", // special tokens... EOT = 33, "", ERROR = 34; "<error>"

Grammars Revisited Context free grammars
Generates a set of sentences Each sentence is a string of terminal symbols An unambiguous sentence has a unique phrase structure embodied in its syntax tree Develop parsers from context-free grammars

Regular Expressions A regular expression (RE) is a convenient notation for expressing a set of stings of terminal symbols Main features ‘|’ separates alternatives ‘*’ indicates that the previous item may be represented zero or more times ‘(‘ and ‘)’ are grouping parentheses

Regular Expression Basics
e The empty string a special string of length 0 Regular expression operations | separates alternatives * indicates that the previous item may be represented zero or more times (repetition) ( and ) are grouping parentheses

Algebraic Properties | is commutative and associative r|s = s|r r|(s|t) = (r|s)|t Concatenation is associative (rs)t = r(st) Concatenation distributes over | r(s|t) = rs|rt (s|t)r = sr|tr e is the identity for concatenation e r = r r e = r * is idempotent r** = r* r* = (r| e)*

Common Extensions r+ one or more of expression r, same as rr* rk k repetitions of r r3 = rrr ~r the characters not in the expression r ~[\t\n] r-z range of characters [0-9a-z] r? Zero or one copy of expression (used for fields of an expression that are optional)

Regular Expression Example
Regular Expression for Representing Months Examples of legal inputs January represented as 1 or 01 October represented as 10 First Try: [0|1|e][0-9] Matches all legal inputs? Yes 1, 2, 3, …, 10, 11, 12, 01, 02, …, 09 Matches any illegal inputs? Yes 0, 00, 18

Regular Expression for Representing Months Examples of legal inputs January represented as 1 or 01 October represented as 10 Second Try: [1-9]|(0[1-9])|(1[0-2]) Matches all legal inputs? Yes 1, 2, 3, …, 10, 11, 12, 01, 02, …, 09 Matches any illegal inputs? No

Regular Expression for Floating Point Numbers Examples of legal inputs 1.0, 0.2, , -1.0, 2.7e8, 1.0E-6 Assume that a 0 is required before numbers less than 1 and does not prevent extra leading zeros, so numbers such as 0011 or are legal Building the regular expression Assume Digit  0|1|2|3|4|5|6|7|8|9 Handle simple decimals such as 1.0, 0.2, Digit+.digit+ Add an optional sign (only minus, no plus) (-| e)digit+.digit+ or -?digit+.digit+

Extended BNF Extended BNF (EBNF) Combination of BNF and RE
N::=X, where N is a nonterminal symbol and X is an extended RE, i.e., an RE constructed from both terminal and nonterminal symbols EBNF Right hand side may use |. *, (, ) Right hand side may contain both terminal and nonterminal symbols

Example EBNF Expression ::= primary-Expression (Operator primary-Expression)* Primary-Expression ::= Identifier | ( Expression ) Identifier ::= a|b|c|d|e Operator ::= +|-|*|/ Generates e a + b a – b – c a + (b * c) a + (b + c) / d a – (b – (c – (d – e)))

Grammar Transformations
Left Factorization XY | XZ is equivalent to X(Y | Z) single-Command ::= V-name := Expression | if Expression then single-Command else single-Command single-Command ::= V-name := Expression (e |else single-Command)

Substitution of nonterminal symbols Given N::=X, we can substitute each occurrence of N with X iff N::=X is nonrecursive and is the only production rule for N single-Command ::= for Control-Variable := Expression To-or-Downto Expression do single-Command | … Control-Variable ::= Identifier To-or-Downto ::= to | down single-Command ::= for Identifier := Expression (to|downto)

Scanning (Lexical Analysis)
The purpose of scanning is to recognize tokens in the source program. Or, to group input characters (the source program text) into tokens. Difference between parsing and scanning: Parsing groups terminal symbols, which are tokens, into larger phrases such as expressions and commands and analyzes the tokens for correctness and structure Scanning groups individual characters into tokens

Lexical Analyzer Source code tokens Symbol Table Parser & Semantic Analyzer parse tree Intermediate Code Generation intermediate representation Optimization intermediate representation Assembly Code Generation Assembly code

Creating Tokens – Mini-Triangle Example
let var y: Integer in !new year y := y+1 Buffer Input Converter character string (S = space) l e t S v a r S y : S I n t e g e r S i n Scanner let var Ident. colon Ident. in Ident. becomes Ident. op. Intlit. eot let var y : Integer in y := y + 1

What Does a Scanner Do? Hand keywords (reserve words)
Recognizes identifiers and keywords Match explicitly Write regular expression for each keyword Identifier is any alpha numeric string which is not a keyword Match as an identifier, perform lookup No special regular expressions for keywords When an identifier is found, perform lookup into preloaded keyword table How does Triangle handle keywords? Discuss in terms of efficiency and ease to code.

What Does a Scanner Do? Remove white space Remove comments
Tabs, spaces, new lines Remove comments Single line -- Ada comment Multi-line, start and end delimiters { Pascal comment } /* c comment */ Nested Runaway comments Nonterminated comments can’t be detected till end of file

What Does a Scanner Do? Perform look ahead Challenging input languages
Multi-character tokens 1..10 vs. 1.10 &, && <, <= etc Challenging input languages FORTRAN Keywords not reserved Blanks are not a delimiter Example (comma vs. decimal) DO10I=1,5 start of a do loop (equivalent to a C for loop) DO10I=1.5 an assignment statement, assignment to variable DO10I

What Does a Scanner Do? Challenging input languages (cont.)
PL/I, keywords not reserved IF THEN THEN THEN = ELSE; ELSE ELSE = THEN;

What Does a Scanner Do? Error Handling
Error token passed to parser which reports the error Recovery Delete characters from current token which have been read so far, restart scanning at next unread character Delete the first character of the current lexeme and resume scanning form next character. Examples of lexical errors: 3.25e bad format for a constant Var#1 illegal character Some errors that are not lexical errors Mistyped keywords Begim Mismatched parenthesis Undeclared variables

Scanner Implementation
Issues Simpler design – parser doesn’t have to worry about white space, etc. Improve compiler efficiency – allows the construction of a specialized and potentially more efficient processor Compiler portability is enhanced – input alphabet peculiarities and other device-specific anomalies can be restricted to the scanner

Scanner Implementation
What are the keywords in Triangle? How are keywords and identifiers implemented in Triangles? Is look ahead implemented in Triangle? If so, how?

Lexical Analyzer Source code tokens Symbol Table Semantic Analyzer Parser parse tree Intermediate Code Generation intermediate representation Optimization intermediate representation Assembly Code Generation Assembly code

Parsing Given an unambiguous, context free grammar, parsing is
Recognition of an input string, i.e., deciding whether or not the input string is a sentence of the grammar Parsing of an input string, i.e., recognition of the input string plus determination of its phrase structure. The phrase structure can be represented by a syntax tree, or otherwise. Unambiguous is necessary so that every sentence of the grammar will form exactly one syntax tree.

Parsing The syntax of programming language constructs are described by context-free grammars. Advantages of unambiguous, context-free grammars A precise, yet easy-to understand, syntactic specification of the programming language For certain classes of grammars we can automatically construct an efficient parser that determines if a source program is syntactically well formed. Imparts a structure to a programming language that is useful for the translation of source programs into correct object code and for the detection of errors. Easier to add new constructs to the language if the implementation is based on a grammatical description of the language

Parsing sequence of tokens parser syntax tree Check the syntax (structure) of a program and create a tree representation of the program Programming languages have non-regular constructs Nesting Recursion Context-free grammars are used to express the syntax for programming languages

Context-Free Grammars
Comprised of A set of tokens or terminal symbols A set of non-terminal symbols A set of rules or productions which express the legal relationships between symbols A start or goal symbol Example: expr  expr – digit expr  expr + digit expr  digit digit  0|1|2|…|9 Tokens: -,+,0,1,2,…,9 Non-terminals: expr, digit Start symbol: expr

Context-Free Grammars
expr digit 3 2 8 + - expr  expr – digit expr  expr + digit expr  digit digit  0|1|2|…|9 Example input:

Checking for Correct Syntax
Given a grammar for a language and a program, how do you know if the syntax of the program is legal? A legal program can be derived from the start symbol of the grammar Grammar must be unambiguous and context-free

Deriving a String The derivation begins with the start symbol
At each step of a derivation the right hand side of a grammar rule is used to replace a non-terminal symbol Continue replacing non-terminals until only terminal symbols remain expr  expr – digit expr  expr + digit expr  digit digit  0|1|2|…|9 Example input: Rule 1 Rule 4 Rule 2 expr  expr – digit  expr – 2  expr + digit - 2 Rule 3 Rule 4 Rule 4  expr  digit 

Rightmost Derivation The rightmost non-terminal is replaced in each step expr  expr – digit Rule 1 expr  expr – digit expr  expr + digit expr  digit digit  0|1|2|…|9 Example input: Rule 4 expr – digit  expr – 2 Rule 2 expr – 2  expr + digit - 2 Rule 4 expr + digit - 2  expr + 8-2 Rule 3 expr  digit + 8-2 Rule 4 digit 

Leftmost Derivation The leftmost non-terminal is replaced in each step
expr  expr – digit Rule 1 expr  expr – digit expr  expr + digit expr  digit digit  0|1|2|…|9 Example input: Rule 2 expr – digit  expr + digit – digit Rule 3 expr + digit – digit  digit + digit – digit Rule 4 digit + digit – digit  3 + digit – digit Rule 4 3 + digit – digit  – digit Rule 4 3 + 8 – digit  – 2

Leftmost Derivation The leftmost non-terminal is replaced in each step
1 expr expr  expr – digit Rule 1 1 Rule 2 2 6 2 expr – digit  expr + digit – digit expr - digit Rule 3 3 expr + digit – digit  digit + digit – digit 3 5 expr + digit Rule 4 2 4 digit + digit – digit  3 + digit – digit Rule 4 5 3 + digit – digit  – digit 4 digit 8 Rule 4 6 3 + 8 – digit  – 2 3

Bottom-Up Parsing Parser examines terminal symbols of the input string, in order from left to right Reconstructs the syntax tree from the bottom (terminal nodes) up (toward the root node) Bottom-up parsing reduces a string w to the start symbol of the grammar. At each reduction step a particular sub-string matching the right side of a production is replaced by the symbol on the left of that production, and if the sub-string is chosen correctly at each step, a rightmost derivation is traced out in reverse.

Bottom-Up Parsing Types of bottom-up parsing algorithms
Shift-reduce parsing At each reduction step a particular sub-string matching the right side of a production is replaced by the symbol on the left of that production, and if the sub-string is chosen correctly at each step, a rightmost derivation is traced out in reverse. LR(k) parsing L is for left-to-right scanning of the input, the R is for constructing a right-most derivation in reverse, and the k is for the number of input symbols of look-ahead that are used in making parsing decisions.

Bottom-Up Parsing Example 3+8-2
expr  expr – digit expr  expr + digit expr  digit digit  0|1|2|…|9 Example input: 3 + 8 - 2 3 + 8 - 2 digit 3 + 8 - 2 digit 3 + 8 - 2 digit expr

Bottom-Up Parsing Example 3+8-2
digit expr 3 + 8 - 2 digit expr expr 3 + 8 - 2 digit

Bottom-Up Parsing Example abbcde
S  aABe A  Abc | b B  d Example input: abbcde a b b c d e A a b b c d e Abbcde  aAbcde A a b b c d e aAbcde

S  aABe A  Abc | b B  d Example input: abbcde A A a b b c d e aAbcde  aAde A A a b b c d e aAde

S  aABe A  Abc | b B  d Example input: abbcde A B A a b b c d e aAde  aABe A B A a b b c d e aABe

S  aABe A  Abc | b B  d Example input: abbcde A B A a b b c d e aABe  S

Top-Down Parsing The parser examines the terminal symbols of the input string, in order from left to right. The parser reconstructs its syntax tree from the top (root node) down (towards the terminal nodes). An attempt to find the leftmost derivation for an input string

Top-Down Parsers General rules for top-down parsers
Start with just a stub for the root node At each step the parser takes the left most stub If the stub is labeled by terminal symbol t, the parser connects it to the next input terminal symbol, which must be t. (If not, the parser has detected a syntactic error.) If the stub is labeled by nonterminal symbol N, the parser chooses one of the production rules N::= X1…Xn, and grows branches from the node labeled by N to new stubs labeled X1,…, Xn (in order from left to right). Parsing succeeds when and if the whole input string is connected up to the syntax tree.

Top-Down Parsing Two forms
Backtracking parsers Guesses which rule to apply, back up, and changes choices if it can not proceed Predictive Parsers Predicts which rule to apply by using look-ahead tokens Backtracking parsers are not very efficient. We will cover Predictive parsers

Predictive Parsers Many types LL(1) parsing Recursive decent parsing
First L is scanning the input form left to right; second L is for producing a left-most derivation; 1 is for using one input symbol of look-ahead Table driven with an explicit stack to maintain the parse tree Recursive decent parsing Uses recursive subroutines to traverse the parse tree

Predictive Parsers (Lookahead)
Lookahead in predictive parsing The lookahead token (next token in the input) is used to determine which rule should be used next For example: term num term’ term  num term’ term’  ‘+’ num term’ | ‘-’ num term’ | e num  ‘0’|’1’|’2’|…|’9’ Example input: term num term’ 7 + num term’

term num term’ term  num term’ term’  ‘+’ num term’ | ‘-’ num term’ | e num  ‘0’|’1’|’2’|…|’9’ Example input: 7 + num term’ 3 term num term’ 7 + num term’ 3 - num term’

term num term’ term  num term’ term’  ‘+’ num term’ | ‘-’ num term’ | e num  ‘0’|’1’|’2’|…|’9’ Example input: + num term’ 7 3 - num term’ 2 term num term’ + num term’ 7 3 - num term’ 2 e

Recursive-Decent Parsing
Top-down parsing algorithm Consists of a group of methods (programs) parseN, one for each nonterminal symbol N of the grammar. The task of each method parseN is to parse a single N-phrase These parsing methods cooperate to parse complete sentences

Sentence  Subject Verb Object. Subject  I | a Noun | the Noun Object  me | a Noun | the Noun Noun  cat | mat | rat Verb  like | is | see | sees ParseNoun if input = “cat” accept else if input =“mat” else if input = “rat” else error Noun  cat | mat | rat

Systematic Development of a Recursive-Descent Parser
Given a (suitable) context-free grammar Express the grammar in EBNF, with a single production rule for each nonterminal symbol, and perform any necessary grammar transformations Always eliminate left recursion Always left-factorize whenever possible Transcribe each EBNF production rule N::=X to a parsing method parseN, whose body is determined by X Make the parser consist of: A private variable currentToken; Private parsing methods developed in previous step Private auxiliary methods accept and acceptIt, both of which call the scanner A public parse method that calls parseS, where S is the start symbol of the grammar), having first called the scanner to store the first input token in currentToken

Quote of the Week “C makes it easy to shoot yourself in the foot; C++ makes it harder, but when you do, it blows away your whole leg.” Bjarne Stroustrup

Quote of the Week Did you really say that? Dr. Bjarne Stroustrup:
Dr. Bjarne Stroustrup: Yes, I did say something along the lines of C makes it easy to shoot yourself in the foot; C++ makes it harder, but when you do, it blows your whole leg off. What people tend to miss is that what I said about C++ is to a varying extent true for all powerful languages. As you protect people from simple dangers, they get themselves into new and less obvious problems. Someone who avoids the simple problems may simply be heading for a not-so-simple one. One problem with very supporting and protective environments is that the hard problems may be discovered too late or be too hard to remedy once discovered. Also, a rare problem is harder to find than a frequent one because you don't suspect it. I also said, "Within C++, there is a much smaller and cleaner language struggling to get out." For example, that quote can be found on page 207 of The Design and Evolution of C++. And no, that smaller and cleaner language is not Java or C#. The quote occurs in a section entitled "Beyond Files and Syntax". I was pointing out that the C++ semantics is much cleaner than its syntax. I was thinking of programming styles, libraries and programming environments that emphasized the cleaner and more effective practices over archaic uses focused on the low-level aspects of C.

Converting EBNF Production Rules to Parsing Methods
For production rule N::=X Convert production rule to parsing method named parseN Private void parseN () { Parse X } Refine parseE to a dummy statement Refine parse t (where t is a terminal symbol) to accept(t) or acceptIt() Refine parse N (where N is a non terminal symbol) to a call of the corresponding parsing method parseN() Refine parse X Y to { parseX parseY }} Refine parse X|Y Switch (currentToken.kind) { Cases in starter[[X]] Break; Cases in starters[[Y]]: Parse Y Break Default: Report a syntax error

For X | Y Choose parse X only if the current token is one that can start an X-phrase Choose parse Y only if the current token is one that can start an Y-phrase starters[[X]] and starters[[Y]] must be disjoint For X* Choose while (currentToken.kind is in starters[[X]]) starter[[X]] must be disjoint from the set of tokens that can follow X* in this particular context

A grammar that satisfies both these conditions is called an LL(1) grammar Recursive-descent parsing is suitable only for LL(1) grammars

Error Repair Good programming languages are designed with a relatively large “distance” between syntactically correct programs, to increase the likelihood that conceptual mistakes are caught on syntactic errors. Error repair usually occurs at two levels: Local: repairs mistakes with little global import, such as missing semicolons and undeclared variables. Scope: repairs the program text so that scopes are correct. Errors of this kind include unbalanced parentheses and begin/end blocks.

Error Repair Repair actions can be divided into insertions and deletions. Typically the compiler will use some look ahead and backtracking in attempting to make progress in the parse. There is great variation among compilers, though some languages (PL/C) carry a tradition of good error repair. Goals of error repair are: No input should cause the compiler to collapse Illegal constructs are flagged Frequently occurring errors are repaired gracefully Minimal stuttering or cascading of errors. LL-Style parsing lends itself well to error repair, since the compiler uses the grammar’s rules to predict what should occur next in the input

Mini-Triangle Production Rules
Program ::= Command Program (1.14) Command ::= V-name := Expression AssignCommand (1.15a) | Identifier ( Expression ) CallCommand (1.15b) | Command ; Command SequentialCommand (1.15c) | if Expression then Command IfCommand (15.d) else Command | while Expression do Command WhileCommand (1.15e | let Declaration in Command LetCommand (1.15f) Expression ::= Integer-Literal IntegerExpression (1.16a) | V-name VnameExpression (1.16b) | Operator Expression UnaryExpression (1.16c) | Expression Operator Expression BinaryExpressioiun (1.16d) V-name ::= Identifier SimpelVname (1.17) Declaration ::= const Identifier ~ Expression ConstDeclaration (1.18a) | var Identifier : Typoe-denoter VarDeclaration (1.18b) | Declaration ; Declaration SequentialDeclaration (1.18c) Type-denoter ::= Identifier SimpleTypeDenoter (1.19)

Abstract Syntax Trees An explicit representation of the source program’s phrase structure AST for Mini-Triangle

Abstract Syntax Trees Program ASTs (P): Program C Command ASTs (C):
Program ::= Command Program (1.14 Program C Command ASTs (C): AssignCommand CallCommand SequentialCommand V E Identifier E C1 C2 (1.15a) (1.15b) (1.15c) spelling Command ::= V-name := Expression AssignCommand (1.15a) | Identifier ( Expression ) CallCommand (1.15b) | Command ; Command SequentialCommand (1.15c)

Abstract Syntax Trees Command ASTs (C): WhileCommand LetCommand
SequentialCommand V E D C E C1 C2 (1.15e) (1.15f) (1.15d) Command ::= | if Expression then Command IfCommand (15.d) else Command | while Expression do Command WhileCommand (1.15e | let Declaration in Command LetCommand (1.15f)

Midterm Review: Chapter 1
Context-free Grammar A finite set of terminal symbols A finite set of non-terminal symbols A start symbol A finite se to production rules Aspects of a programming language that need to be specified Syntax: form of programs Contextual constraints: scope rules and type variables Semantics: meaning of programs

Language specification Informal: written in English Formal: precise notation (BNF, EBNF) Unambiguous Consistent Complete Context-free language Syntax tree Phrase Sentence

Syntax tree Terminal node labeled by terminal symbol Non-terminal nodes labeled b y non-terminal symbol Abstract Syntax Tree (AST) Each non-terminal node ius labeled by production rule Each non-terminal node has exactly one subtree for each subprogram Does not generate sentences

Translator Accepts any text expressed in one language (source language) and generates a semantically-equivalent text expressed in another language (target language) Compiler Translates from high-level language into low-level language Interpreter A program that accepts any program (source program) expressed in a particular language (source language) and runs that source program immediately

Interpretive compiler Combination of compiler and interpreter Some of the advantages of each Portable compiler Compiled and run on any mainline, without change Portability measured by proportion of code that remains unchanged Portability is an economic issue Bootstrapping Using the language processor to process itself Tombstone diagrams

Three phases of compilation Syntactic analysis Contextual analysis Code generation Single pass compilers Multi-pass compilers Compiler design issues Speed Space Modularity Flexibility Semantic preserving transformations Source language properties

Sub-phases of syntactic analysis Scanning (lexical analysis) Source program transformed to stream of tokens Comments and blank spaces between tokens are discarded Parsing Source program in form of stream of tokens parsed to determine phrase structure Parser treats each token as a terminal symbol Representation of the phrase structure A data structure representing the source program’s phrase structure Typically an abstract syntax tree (AST)

Tokens An atomic symbol of the source program May consist of several characters Classified according to kind All tokens of the same kind can be freely interchanged without affecting the program’s phrase structure Each token completely described by it’s kind and spelling Token represented by tuple Only kind of each token examined by parser Spelling examined by contextual analyzer and/or code generator

Grammars Regular expressions “|” separates alternatives “*” indicates that the previous item may be repeated zero or more times “(“ and “)” are grouping parenthesis e is the empty string a special string of length 0 Algebraic properties Common extensions Grammar transformations Left factorization Elimination of left recursion Substitution of non-terminal symbols

Structure of compiler Source code Lexical analyzer Parser & semantic analyzer Intermediate code generation Optimization Assembly code generation Assembly code

Scanning (lexical analysis) What does it do? Handles keywords (reserve words Removes white space (tabs, spaces, new lines) Removes comments Perform look ahead Error handling Issues Simpler design Improve compiler efficiency Enhance compiler portability

Parsing Given an unambiguous, context-free grammar Recognition of input string – sentence in grammar Parsing an input string – determines its phrase structure Why is unambiguous important? Advantages of unambiguous, context-free grammars (see chart 81) How do you know the syntax of a language is legal? A legal program can be derived from the start symbol of the grammar

Parsing Rightmost (replace rightmost non-terminal in each step) and leftmost (replaced leftmost non-terminal in each step) derivation Bottom-up (reconstructs syntax tree from terminal nodes up toward the root node) and top-down (reconstructs syntax tree from the root node down towards the terminal nodes) Predictive parsers LL(1) Recursive decent

Parsing Converting EBNF production rules to parsing methods Error repair

Chapter 5: Contextual Analysis
Identification Monolithic Block Structure Flat Block Structure Nested Block Structure Attributes Standard Environment Type Checking A Contextual Analysis Algorithm Case Study: Contextual Analysis in the Triangle Compiler

Contextual Analysis Given a parsed program, the purpose of contextual analysis is to check that the program conforms to the source language’s contextual constraints. Scope rules: rules governing declarations and applied occurrences of identifiers Type rules: rules that allow us t0 infer the types of expressions, and to decide whether each expression has a valid type Analysis of the program to determine correctness with respect to the language definition (beyond structure)

Contextual Analysis Contextual analysis consists of two sub-phases:
Identification: applying the source language’s scope rules to relate each applied occurrence of an identifier to its declaration (if any). Type checking: applying the source language's type rules to infer the type of each expression, and compare that type with the expected type.

Lexical Analyzer Source code tokens Symbol Table Semantic Analyzer Parser parse tree Intermediate Code Generation Semantic Analyzer Identification Type checking intermediate representation Optimization intermediate representation Assembly Code Generation Assembly code

Identification Relate each applied occurrence of an identifier in the source program to the corresponding declaration Ill-formed program if no corresponding declaration – generate error Identification could cause compiler efficiency problems Inefficient to use the AST

Identification Table Also known as symbol table
Associates identifiers with their attributes Basic operation Make the identification table empty Add an entry associating a given identifier with a given attribute Retrieve the attribute (if any) associated with a given identifier Attribute Consists of information relevant to contextual analysis Obtained from the identifier’s declaration

Identification Table Each declaration in a program has a defined scope
Portion of program over which the declaration takes effect Block: any program phase that delimits the scope of declarations within it Example Triangle block command Let D in C Scope of each declaration in D extends over the subcommand C

Identification Table: Structure/Implementation
Maintain scope An identifier should be found in the table only when valid If an identifier is defined in multiple scopes, then a lookup in the table must provide the appropriate meaning for the use Efficiency How fast is lookup? How fast to enter/exit a scope? What is the overall table size?

Identification Table: Structure/Implementation
Different implementations Organized for efficient retrieval Binary search tree Hash table

Identification Table: Functionality
A mapping of identifiers to their meanings Information Name Type Location Operations Create Insert Lookup Delete Update entry Entering a new scope Leaving a scope

Block Structures Monolithic block structure Flat block structure
Basic and Cobol Flat block structure Fortran Nested block structure Pascal, Ada, C, and Java

Monolithic Block Structure
The only block is the entire program All declarations are global Simple rules No identifier may be declared more than once For every applied occurrence of an identifier I, there must be a corresponding declaration of I No identifier may be used unless declared The identification table should contain entries for all declarations in the source program At most, one entry for each identifier The table contains an identifier I and the associated attribute A

Monolithic Block Structure
Program integer b = 10 integer n char C begin … n = n * b Write c end Identification Attribute b n c (1) (2) (3) Create new table create command At declaration for identifier I, make table entry insert command At applied occurrence of identifier I, retrieve information from table lookup command

Flat Block Structure Program partitioned into several disjoint blocks
Two scope levels Some declarations are local in scope Identifiers restricted to particular block Other declarations are global in scope Identifiers allowed anywhere in the program – the program as a whole is a block Less simple rules No global declared identifier may be redeclared globally But same identifier may also be declared locally No locally declared identifier may be redeclared in the same block Same identifier may be declared locally in several different blocks For every applied occurrence of an identifier I in a block B, there must be a corresponding declaration of I Either global declaration of I or a declaration of I local to B Minor complication is to distinguish global and local declaration entries

Flat Block Structure Identification Attribute Q r pi (1) (2) (3) Level
global local (5) integer c begin … end (4) procedure R (2) real r (3) real pi = 3.14 (1) procedure Q (6) integer i (7) boolean b (8)char c call R program Create new table create command At start of a block enter new scope command At end of a block leave scope command delete command At declaration for identifier I, make table entry insert command At applied occurrence of identifier I, retrieve information from table lookup command Identification Attribute Q R c (1) (4) (5) Level global local Identification Attribute Q R i (1) (4) (6) Level global local Identification Attribute Q R (1) (4) Level global local b (7) local c (8)

Nested Block Structure
Blocks may be nested one within another Many scope levels Declarations in the outermost block are global in scope. The outermost block is at scope level 1 Declarations inside an inner block are local to that block Every inner block is completely enclosed by another block Next to outermost block is at scope level 2 If enclosed by a level-n, the block is at scope level n+1

More complex rules No identifier may be declared more than once in the same block Same identifier may be declared in different blocks, even if they are nested For every applied occurrence of an identifier I in a block B, there must be a corresponding declaration of I Must be in B itself Or in the block B’ immediately enclosing B Or in B’’ immediately enclosing B’ Etc. In smallest enclosing block that contains any declaration of I

Create new table create command At start of a block enter new scope command At end of a block leave scope command delete command At declaration for identifier I, make table entry insert command Level number determined by number of calls to enter new scope At applied occurrence of identifier I, retrieve information from table using highest level for I lookup command Identification Attribute a b (1) (2) Level 1 Let (1) var a: Integer; (2) var b: Boolean In begin …; let (3) var b: Integer; (4) var c: Boolean In begin …; Identification Attribute a b (1) (2) (3) Level 1 2 c (4) let (5) var d: Integer; In …; … end; Identification Attribute a b (1) (2) (3) Level 1 2 3 c d (4) (5) Identification Attribute a b d (1) (2) (6) Level 1 2 e (7) … let (6) var d: Boolean; (7) Var e: Integer in …; … end

Attributes Examples Kind Type constant variable procedure function
boolean character integer record array

Attributes Information to be extracted from declaration
Constant, variable, procedure, function, type Procedure or function declaration includes a list of formal parameters that may be a constant, variable, procedural, or functional parameter Language provides whole families of record and array types How to manage attribute information Extract type information from declarations and store in information table Could be complex for a realistic programming language Could require tedious programming Use the AST Pointers in information table pointing to location in AST with that identifier

Attributes       (1) (2) (6) (7) Identification Attribute a b
Program LetCommand SequentialDeclaration SequentialCommand (1) (2) VarDeclaration VarDeclaration SequentialCommand . . . Ident. int Ident. bool . . . LetCommand a b SequentialDeclaration . . . (6) VarDeclaration (7) VarDeclaration Ident. bool Ident. int d e Identification Attribute a b Level 1 Identification Attribute a b d Level 1 2 e       2

Standard Environment Predefined constants, variables, types, procedures, and functions These are loaded into the identification table Scope rules for standard environment Scope enclosing the entire program Level 0 Same scope level as global declarations Example is C

Type Checking Second task of contextual analyzer is to ensure that the source program contains on type errors Once applied occurrence of an identifier has been identified, the contextual analyzer will check that the identifier is used in a way consistent with its declaration

Type Checking Statically –typed language can detect any type errors without actually running the program For every expression E in the language, the compiler can infer either that E has some type T or that E is ill-typed If E does have type T, then E will always yield a value of type T If a value of type T’ is expected, then compiler checks that T’ is equivalent to T

Type Checking Infers the type of each expression bottom-up
Starting with literals and identifiers, and working up through larger and larger subexpressions Literal: The type of a literal is immediately known Identifier: The type of an applied occurrence of identifier I is obtained from the corresponding declaration of I Unary operator application: Consider “O E” where O is a unary operator of type T1  T2 Type checker ensures that E’s type is equivalent to T1 Infers that type of “O E” is T2. Otherwise a type error Binary operator application: Consider “E1 O E2” where O is binary operator of type T1 X T2  T3 E1’s type is equivalent to T1 E2’s type is equivalent to T2 ‘E1 O E2‘ is of type T3 Otherwise type error

Type Checking Type of a nontrivial expression is inferred from the types of its sub-expressions, using the appropriate type rules Must be able to test if two given types T and T’ are equivalent

Type Checking – Constant or Variable Identifier
ConstDeclaration Ident. Expr. x . . . :T SimpelVname Ident. x ConstDeclaration Ident. Expr. x . . . :T SimpelVname

Type Checking – Variable Declaration
SimpelVname Ident. x VarDeclaration Ident. T x VarDeclaration SimpelVname :T Ident. Ident. T x x

Type Checking – Binary Operator
BinaryExpression Ident. . . . Expr. Op. :int < BinaryExpression :bool Ident. Op. Expr. :int :int . . . < . . . < is of type Int X int  bool

Type Checking Each applied occurrence of an identifier must be identified before type checking can proceed

Chapter 6: Run-time Organization
Marshal the resources of the target machine (instructions, storage, and system software) in order to implement the source language

Chapter 6: Run-time Organization
Data Representation How should we represent the values of each source-language type in the target machine? Expression Evaluation How should we organize the evaluation of expressions, taking care of intermediate results? Static Storage Allocation How should we organize storage for variables, taking into account the different lifetimes of global, local, and heap variables? Stack Storage Allocation Routines How should we implement procedures, functions, and parameters, in terms of low-level routines? Heap Storage Allocation Run-time Organization for Object-oriented Languages How should we represent objects and methods? Case Study: The Abstract Machine TAM Requirements are defined first: Customer needs are analyzed and requirements are developed for the entire system prior to any of the remaining DLM activities being initiated. Multiple internal development cycles: Multiple subprojects implementing a subset of the system are planned and conducted as part of the overall project. Multiple customer deliveries: Multiple deliveries (I.e., builds) of the system are provided to the customer at interim points in the project. Only the finial build typically contains the total functionality of the system. Functional content primary driver: Refers to the primary kind of information that determines the functionality of the system. Process primary driver: Refers to the primary process related issues that has heavy influence on selecting the most appropriate DLM.

Data Representation High-Level Data Types Machine Data Types
How should we represent the values of each source-language type in the target machine? High-Level Data Types Truth values Integers Characters Records Arrays Operations over these types Machine Data Types Bits Bytes Words Double-words Low-level arithmetic and logical operations Need to bridge the semantic gap between high-level types and machine level types

Data Representation -- Fundamental Principles
Non-confusion Different values of a given type should have different representations If two different values are confused, i.e., have the same representation, then comparison of these values will incorrectly treat the values as equal Example: approximate representation of real numbers Real numbers that are slightly different mathematically might have the same approximate representation Difficult to avoid – need to take care during compiler design Must avoid confusion in the representations of discrete types such as truth values, characters, and integers For statically typed languages need only be concedrned with values of the same type 00…002 may represent false, the integer 0, the real number 0.0 Compile time type checks will denote the values of different types

Data Representation -- Fundamental Principles
Uniqueness Each value should always have the same representation Example of non-uniqueness Ones-complement representation of integers in which zero is represented both by and 11…112 (+0 and –0) A simple bit-string co0parison would incorrectly treat these values as unequal More specialized integer comparison must be used Alternative twos-complement representation gives us unique representations of integers

Data Representation – Pragmatic Issues
Constant-size representation The representations of all values of a given type should occupy the same amount of space Make possible for compiler to plan the allocation of storage Knowing the type of variable but not the actual value, the compiler will know exactly how much storage space the variable will occupy

Data Representation – Pragmatic Issues
Direct representation vs. indirect representation Should the values of a given type be represented directly, or indirectly through pointers? Direct representation Just the binary representation of the value consisting of one or more bits, bytes, words Indirect representation A handle that points to the storage area which has the binary representation of the value Essential for types whose values vary greatly in size List or dynamic array

Direct representation vs. indirect representation
x y handle Same type as x but requiring more space

Notation #T: cardinality of type T
Number of distinct values of type T #[[Boolean]] = 2 Size T: amount of space (in bits, bytes, or words) occupied by each value of type T For indirect representation only handle is counted For direct representation of type T size T  log2 (#T) or 2(size T)  #T size T is represented in bits In n bits we can represent at most 2n distinct values if we are to avoid confusion  non-confusion requirement

Primitive Types Cannot be decomposed into simpler values
Most programming languages provide these primitive types Boolean, Char, Integer Also provide elementary logical and arithmetic operations Machines typically support the above primitive types, so choice of representation is straightforward

Primitive Types Representation
Boolean true and false Since #[[Boolean]] = 2 then size[[Boolean]]  1 bit Can represent Boolean with one bit, one bye, or one word For single bit: 0 for false and 1 for true For byte or word: 00…002 for false and either 00…012 or 11…112 for true Negation, conjunction, disjunction  NOT, AND, OR

Char Source language can specify character set Ada: ISO-Latin1 character set (28 distinct characters) Java: Unicode character set (216 distinct characters) Most do not Allows compiler writers to choose the machine’s native character set (27 or 28 distinct characters) ISO defines character representation for “A” to be Can represent a character by one byte or one word

Integer Denotes an implementation-defined bounded range of integers Defined by the individual language processor Binary representation determined by target machine’s arithmetic unit and almost always occupies one word Can implement language’s integer operations with machine's integer operations Pascal and Triangle -maxint, …, -1, 0, +1, …, +maxint maxint is implementation defined #[[Integer]] = 2 X maxint + 1 2size[[Integer]]  2 X maxint + 1 For word size of w bits, size[[Integer]] = w, maxint = 2w-1 – 1 Java Int denotes –231, …, -1, 0, +1, …, +231 – 1 #[[Int]] = 232

Record Type Consists of several fields, each of which has an identifier All records of a particular type have fields with the same identifiers and types Fundamental operation on records is field selection Use one field identifier to access the corresponding field Simple representation Juxtapose the fields to make them occupy consecutive positions in storage Allows us to predict total sized of each record and the position of each field relative to the base of the record

Record Type Consider the following size T = size T1 + … + size Tn
type T = record I1: T1, …, In: Tn end; var r: T size T = size T1 + … + size Tn If size T1, .., and size Tn are all constant, then size T is also constant Implementation of field selection Address[[r.Ii]] = address r + (size T1 + … + size Ti-1) Some machines have alignment restrictions, which force unused space to be left between record fields; cannot use these equations r.I1 Value of type T1 r.I2 Value of type T2 … … r.In Value of type Tn

Disjoint Unions Tag and a variant part
Value of tag determines type of variant part T = T1 + … + Tn In each value of type T, the variant part is a value chosen from one of the types T1, …, or Tn; the tag indicates which one Size T = size Ttag + max(sizeT1, …, size Tn) Address[[u.Itag]] = address u + 0 Address[[u.Ii]] = address u + size Ttag value of type Ttag Will have wasted space u.Itag u.Itag u.Itag value of type T2 value of type T1 value of type Tn u.I1 u.I2 or … or … u.In Max(sizeT1,…,sizeTn) Wasted space

Static Arrays Consists of several elements, all of the same type
Bounded range of indices – usually integers Each index has exactly one element Fundamental operation on arrays is indexing Access an individual element by giving its index Index evaluated at run-time Static Array Index bounds are known at compile-time Direct representation is to juxtapose the array elements, in order of increasing indices. Implemented by run-time address computation

Static Arrays (lower index bound is 0)
Consider the following example Type T = array n of Telem; Var a: T Size T = n X size Telem The number of elements n is constant, so size Telem is constant, then size T is also constant Address[[a[i] ]] = address a + (i X size Telem) Since i is known only at run-time, an array indexing implies a run-time address computation a[0] a[1] a[2] values of type Telem a[n-1]

Static Arrays (programmer chooses lower and upper array bounds)
Consider the following example Type T = array [l..u] of Telem; Var a: T size T = (u - l + 1) X size Telem The number of elements (u – l + 1) is constant, so size Telem is constant, then size T is also constant address[[a[i] ]] = address a + (i – l) X size Telem) = address a – (l X size Telem) + (i X size Telem) Address[[a[0] ]] = address a – (l X size Telem) Address[[a[i] ]] = address[[a[0] ]] + (i X size Telem) Since i is known only at run-time, an array indexing implies a run-time address computation Index check must ensure that l  i  u a[l] a[l+1] a[l+2] values of type Telem a[u]

Dynamic Arrays An array whose index bounds are not know until run-time
Different dynamic arrays of the same type may have different index bounds, and therefore different numbers of elements Need to satisfy constant-size requirement Create array descriptor or handle Pointer to the array’s elements Index bounds Handle has constant size

Dynamic Arrays Ada example size T = address:size + 2 X size[[Integer]]
Type T is array [Integer range <>) of Telem; a: T (E1 .. E2); size T = address:size + 2 X size[[Integer]] Address:size is the amount of space required to store an address – usually one word. Satisfies constant-size requirement Declaration of array variable a: E1 and E2 are evaluated to yield a’s index bounds (say l and u) Space is allocated for (u – l + 1) elements, juxtaposed and separate from a’s handle Address[[a(0)]] = address[[a(l)]] – (l X size Telem) Values for address[[a(0)]], l, and u are stored in a’s handle The element with index i will be address as follows: Address[[a(i)]] = address[[a(0)]] + (i X size Telem) = content(address[[a]]) + (i X size Telem) Index check is l  i  u where l = content(address[[a]] + address:size) and u = content(address[[a]]+ address:size + size[[Integer]]

Dynamic Arrays a[l] a[l+1] a[l+2] origin a[0] a lower bound l
upper bound u handle a[u] elements of type Telem

Status Chapter 6: Run-time Organization Data Representations
Primitive types Record types Disjoint unions Static arrays Dynamic arrays Recursive types Expression Evaluation Register machine Stack machine Static Storage Allocation Global variables Stack Storage Allocation Local variables

Recursive Types Defined in terms of itself
Values of recursive type T have components that are themselves of type T Examples List with tail being itself a list Tree with the sub-trees themselves being trees

Recursive Types Consider the Pascal declaration
type IntList = ^IntNode; IntNode = record head: Integer; tail: IntList end; var primes: IntList Size[[IntList]] = address:size (usually 1 word) primes handle Always use pointers to represent values of the recursive type

Expression Evaluation Register Machine
How should we organize the evaluation of expressions The problem is the need to keep intermediate results somewhere Consider the expression a * b + (1 – (c * 2)) Will have intermediate results for a * b, c * 2, and 1 – (c * 2) For a register based machine (non-stack machine) Use the registers to store intermediate results Problem arises when there are not enough registers for all intermediate results

Expression Evaluation Example a * b + (1 – (c * 2))
LOAD R1 a MULT R1 b LOAD R2 #1 LOAD R3 c MULT R3 #2 SUB R2 R3 ADD R1 R2 a, b, c are memory addresses for the values of a, b, c

Expression Evaluation Stack Machine
The machine provides a stack for holding intermediate results For the expression a * b + (1 – (c * 2)) LOAD a LOAD b MULT LOADL 1 LOAD c LOADL 2 SUB ADD

Expression Evaluation Stack Machine Example a * b + (1 – (c * 2))
(1) After LOAD a (2) After LOAD b (3) After MULT (4) After LOAD 1 value of a value of a value of a*b value of a*b value of b 1 unused space (5) After LOAD c (6) After LOAD 2 (7) After MULT (8) After SUB value of a*b value of a*b value of a*b value of a*b 1 1 1 value of 1-(c*2) value of c value of c value of c*2 2 (9) After ADD value of (a*b)+(1-(c*2)) Operands of different types (and therefore different sizes) can be evaluated in just the same way. E.g., AND, OR, function, etc. Each operation takes values from top of stack and places results onto top of stack

Static Storage Allocation Global Variables
Each variable in source program requires enough storage to contain any value that might be assigned to it As a consequence of constant-size representation, the compiler knows how much storage needs to be allocated to variable, based on type of variable (size T) Global variables Variables that exist and take up storage throughout the program’s run-time. Static storage allocation: Compiler locates these variables at some fixed positions in storage (decides each global variable’s address relative to the base of the storage region in which global variables are located)

Static Storage Allocation Global Variables: Example
let type Date = record y: Integer, m: Integer; d: Integer end; var a: array 3 of Integer; var b: Boolean; var c: Char; var t: Date in . . . a(0) a(1) a(2) b c t.y t.m t.d unused space a t

Stack Storage Allocation Local Variables
A local variable v is one that is declared inside a procedure (or function). Lifetime of v: the variable v exists (occupies storage) only during an activation of that procedure If same procedure is activated several times v will have several lifetimes Each activation creates a distinct variable

Stack Storage Allocation Local Variables: An Example
let var a: array 3 of Integer; var b: Boolean; var c: Char; proc Y () ~ var d: Integer; var e: record c: Char, n: Integer end in . . . proc Z () ~ var f: Integer begin …; Y(); … end begin …; Y(); …; Z(); … end

Stack Storage Allocation Local Variables: An Example
time Program calls Y Return from Y calls Z Z calls Y from Z stops Lifetime of variables local to Y Lifetime of variables local to Z Lifetime of global variables Observations: Global variables are the only ones that exist throughout the program’s run-time Use static allocation for global variables Lifetimes of local variables are properly nested Use a stack for local variables

Stack Storage Allocation Stack Frames: An Example
(1) After program starts (2) After program calls Y (3) After return from Y (4) After program calls Z SB SB SB SB globals globals globals globals ST LB ST LB frame for Z frame for Y ST ST (5) After Z calls Y (6) After return from Y (7) After return from Z SB SB SB dynamic links globals globals globals LB ST frame for Z frame for Z Registers SB: Stack Base – Location of global variables LB: Local Base – Local variables of currently running procedure ST: Stack Top – Very top of stack LB frame for Y ST ST

Stack Storage Allocation
The stack varies in size For example, the frames for each of Y’s activation are at two different locations The position of a frame within a stack cannot be predicted in advance Need registers dedicated to point to the frames Registers (find address of variables relative to these registers) SB: stack base – is fixed, pointing to the base of the stack. This is where the global variables are located. LB: local base – points to the base of the topmost frame in the stack. This frame always contains the variables of the currently running procedure. ST: stack top – points to the very top of the stack. ST keeps track of the frame boundary as expressions are evaluated and the top of the stack expands and contracts.

Stack Storage Allocation
Frame contents Space for local variables Link data Return address – code address to which control will be returned at the end of the procedure activation. It is the address of the instruction following the call instruction that activated the procedure in the first place. Dynamic link – the pointer to the base of the underlying fram e in the stack. It is the old content of LB and will be restored at end of procedure activation Since there are two words of link data, local variable addresses are offset by 2 dynamic link link data return address This only considers access to local or global variables, not nested variables. local data

Chapter 7: Code Generation
Code Selection A Code Generation Algorithm Constants and Variables Procedures and Functions Case Study: Code Generation in the Triangle Compiler Requirements are defined first: Customer needs are analyzed and requirements are developed for the entire system prior to any of the remaining DLM activities being initiated. Multiple internal development cycles: Multiple subprojects implementing a subset of the system are planned and conducted as part of the overall project. Multiple customer deliveries: Multiple deliveries (I.e., builds) of the system are provided to the customer at interim points in the project. Only the finial build typically contains the total functionality of the system. Functional content primary driver: Refers to the primary kind of information that determines the functionality of the system. Process primary driver: Refers to the primary process related issues that has heavy influence on selecting the most appropriate DLM.

Code Generation Translation of the source program to object code
Dependent on source language and target machine Target Machines Registers, or stack, or both for intermediate results Instructions with zero, one, two, or three operands, or a mixture Single addressing mode, or many

Code Generation Major Subproblems
Code selection: which sequence of target machine instructions will be the object code for each phrase Write code templates: a general rule specifying the object code of all phases of a particular form (e.g., all assignment commands, etc.) But there are usually lots of special cases Storage allocation: deciding the storage address of each variable in source program Exact for glob al variables, but only relative for local variables Register allocation: should be used to hold intermediate results during expression evaluation Complex expressions -- not enough registers Since code generation for stack machine much simpler than for register machine, will only generate code for stack machine

Code Generation Code Selection
Deciding which sequence of instructions to generate for each case Code template: specifies the object code to which a phrase is translated, in terms of the object code to which its sub phrases are translated. Object code: sequence of instructions to which the source-language phrase will be translated Code specification: collection of code functions and code templates; must cover the entire source langauge

Abstract Machine TAM Suitable for executing programs compiled from a block-structured language such as Triangle All evaluation takes place o a stack Primitive arithmetic, logical, and other operations are treated uniformly with programmed functions and procedures Two separate stores Code Store: 32-bit instruction words (read only) Data Store: 16-bit data words (read-write)

Abstract Machine TAM Code and Data Stores
Code Store Fixed while program is running Code segment: contains the program’s instructions CB  points to base of code segment CT  points to top of code segment CP  points to next instruction to be executed Initialized to CB (programs first instruction is at base of code segment) Primitive segment: contains ‘microcode’ for elementary arithmetic , logical, input-output, heap, and general-purpose operations PB  points to base of primitive segment PT  points to top of primitive segment

While program is running segments of data store may vary Stack grows from low-address end of Data Store SB  points to base of the stack ST  points to top of the stack Initialized to SB Heap grows from the high-address endo fo Data Store HB  points to base of heap HT  points to top of heap Initialized to HB

Code Store code segment unused primitive CB CP CT PB PT Data Store SB global segment Stack and heap can expand and contract Global segment is always at base of stack Stack can contain any number of other segments known as frames containing data local to an activation of some routine LB  points to base of topmost frame frame stack LB frame ST unused HT heap segment HB

Code Functions run P Run the program P and then halt, starting and finishing with an empty stack Execute the command C, possibly updating variables, but neither expanding nor contracting the stack Execute the expression E, pushing its result on to the stack top, but having no other effect Push the value of the constant or variable named V on to the stack top Pop a value from t he stack top, and store it in the variable named V Elaborate the declaration D, expanding the stack to make space for any constants and variables declared therein execute C evaluate E fetch V assign V elaborate D

Abstract Machine TAM Instructions
LOAD(n) d[r] Fetch an n-word object from the data address (d+register r), and push it on the stack Push the data address (d+register r) on to the stack Pop a data address from the stack, fetch an n-word object from that address, and push it on to the stack Push the 1-word literal value d on to the stack Pop an n-word object from the stack, and store it at the data address (d+register r) Pop an address from the stack, then pop an n-word object from t he stack and store it at that address Call the routine at code address (d+register r), using the address in register n as the static link Pop a closure (static link and code address) from the stack, then call the routine at that code address Return from the current routine: pop an n-word result from the stack, then pop the topmost frame, then pop d words of arguments, then push the result back on to the stack Push d words (uninitialized) on to the stack Pop an n-word result from the stack, then pop d more words, then push the result back on to the stack Jump to code address (d+register r) Pop a code address from the stack, then jump to that address Pop a 1-word value from the stack, then jump to code address (d+register r) if and only if that value equals n Stop execution of the program LOADA d[r] LOADI(n) LOADL d STORE(n) d[r] STOREI(n) CALL(n) d[r] CALLI RETURN(n) d PUSH d POP(n) d JUMP d[r] JUMPI JUMPIF(n) d[r] HALT

While Command execute [[while E do C]] = g: execute C h: evaluate E
JUMP h g: execute C h: evaluate E JUMPIF(1) g

While Command execute [[while i > 0 do i := i – 2]]
execute [[i := I – 2]] execute [[i > 0]] 30: JUMP 35 // JUMP h g: 31: LOAD i 32: LOADL 2 33: CALL sub 34: STORE i h: 35: LOAD i 36: LOADL 0 37: CALL gt 38: JUMPIF(1) 31 // JUMPIF(1) g

While Command public Object visitWhileCommand(WhileCommand ast, Object o) { Frame frame = (Frame) o; int jumpAddr, loopAddr; jumpAddr = nextInstrAddr; // saves the next instruction address (g:) to put in JUMP command emit(Machine.JUMPop, 0, Machine.CBr, 0); // puts the JUMP h instruction in obj file loopAddr = nextInstrAddr; // this is address g: ast.C.visit(this, frame); // this generates code for C patch(jumpAddr, nextInstrAddr); // this establishes address h: that was needed in the JUMP h statement ast.E.visit(this, frame); // this generated code for E emit(Machine.JUMPIFop, Machine.trueRep, Machine.CBr, loopAddr); // this generated code to check expression, if false to address g: return null; }

While Command execute [[while E do C]] = g: execute C evaluate E
JUMPIF(1) g

Repeat Command execute [[repeat i := i – 2 until i < 0 do ]]
execute [[i := i – 2]] execute [[i > 0]] g: 31: LOAD i 32: LOADL 2 33: CALL sub 34: STORE i 35: LOAD i 36: LOADL 0 37: CALL lt 38: JUMPIF(0) 31 // JUMPIF(0) g

Repeat Command public Object visitRepeatCommand(RepeatCommand ast, Object o) { Frame frame = (Frame) o; int jumpAddr, loopAddr; // emit(Machine.JUMPop, 0, Machine.CBr, 0); // jumpAddr = nextInstrAddr; loopAddr = nextInstrAddr; ast.C.visit(this, frame); // patch(jumpAddr, nextInstrAddr); ast.E.visit(this, frame); emit(Machine.JUMPIFop, Machine.falseRep, Machine.CBr, loopAddr); return null; }

Abstract Machine TAM Routines

Abstract Machine TAM Primitive Routines

Extend Mini-Triangle V1 , V2 := E1 , E2
This is a simultaneous assignment: both E1 and E2 are to be evaluated, and then their values assigned to the variables V1 and V2, respectively evaluate E1 evaluate E2 assign V2 assign V1 Results pushed to top of stack Top of stack stored in variable V2 Top of stack stored in variable V1 ST Result E1 ST Result E1 Result E1 ST ST Result E2 ST Result E2 V2 Result E2 V2 Result E1 V1

Extend Mini-Triangle C1 , C2
This is a collateral command: the subcommands C1 and C2 are to be executed in any order chosen by the implementer execute C1 execute C2 Top of stack unchanged

Extend Mini-Triangle if E then C
This is a conditional command: if E evaluates to true, C is executed, otherwise nothing evaluate E JUMPIF (0) g execute C g: Results pushed to top of stack Jump to g if E evaluates to false Top of stack unchanged Jump location

Extend Mini-Triangle repeat C until E
This is a loop command: E is evaluated at the end of each iteration (after executing C), and the loop terminates if its value is true g: execute C evaluate E JUMPIF (0) g Top of stack unchanged Results pushed to top of stack Jump to g if E evaluates to false

Extend Mini-Triangle repeat C1 while E do C2
This is a loop command: E is evaluated in the middle of each iteration (after executing C1 but before executing C2), and the loop terminates if its value is false JUMP h g: execute C2 h: execute C1 evaluate E JUMPIF (1) g Top of stack unchanged Results pushed to top of stack Jump to g if E evaluates to true

Extend Mini-Triangle if E1 then E2 else E3
This is a conditional expression: if E1 evaluates to true, E2 is evaluated, otherwise E3 is evaluated (E2 and E3 must be of the same type) evaluate E1 JUMPIF (0) g evaluate E2 JUMP h g: evaluate E3 h: Results pushed to top of stack Jump to g if E evaluates to false Jump location

Extend Mini-Triangle let D in E
This is a block expression: the declaration D is elaborated, and the resultant bindings are used in the evaluation of E elaborate D evaluate E POP (n) s Expand stack for variables or constants Results pushed to top of stack Pop an n word from stack, pop s more, then push first n-word back on stack If s>0 where s = amount of storage allocated by D n = size (type of E)

Extend Mini-Triangle begin C; yield E end
Here the command C is executed (making side effects), and then E is evaluated execute C evaluate E Top of stack unchanged Results pushed to top of stack

Extend Mini-Triangle for I from E1 to E2 do C
First the expressions E1 and E2 are evaluated, yielding the integer m and n, respectively. Then the subcommand C is executed repeatedly, with I bound to integers m, m+1, …, n in successive iterations. If m < n, C is not executed at all. The scope of I is C, which may fetch I but may not assign to it.

Extend Mini-Triangle for I from E1 to E2 do C
evaluate E2 evaluate E1 JUMP h g: execute C CALL succ h: LOAD –1 [ST] LOAD –3 [ST] CALL le JUMPIF(1) g POP(0) 2 Compute final value Compute initial value of I Top of stack unchanged Increment current value of I Fetch current value of I Fetch final value Test current value <= final value If so, repeat Discard current and final values At g and at h, the current value of I is at the stack top (at address –1 [ST], and the final value is immediately underlying (at address –2 [ST]

Chapter 8: Interpretation
Interactive Interpretation Interactive Interpretations of Machine Code Interactive Interpretation of Command Languages Interactive Interpretation of Simple Programming Languages Recursive Interpretation Case Study: The TAM Interpreter

Chapter 9: Conclusion The Programming Language Life Cycle
Design Specification Prototype Compilers Error Reporting Compile-time Error Reporting Run-time Error Reporting Efficiency Compile-time Efficiency Run-time Efficiency

Programming Language Lifecycle: Concepts
Values Types Storage Bindings Abstractions Encapsulation Polymorphism Exceptions Concurrency Concepts Advanced Concepts

Programming Language Lifecycle: Simplicity & Regularity
Strive for simplicity and regularity Simplicity: support only the concepts essential to the applications for which language is intended Regularity: should combine those concepts in a systematic way, avoiding restrictions that might surprise programmers or make their task more difficult

Design Principles Type completeness: no operation should be arbitrarily restricted in the types of its operands Operations like assignment and parameter passing should, ideally, be applied to all types Abstraction: for phrase that specifies some kind of computation, should be a way to abstract that phrase and parameterize it Should be possible to abstract any expression and make it a function Correspondence: for each form of declaration there should be corresponding parameter mechanism Take a block with a constant definition and transform it into as procedure (or function) with a constant parameter

Programming Language Lifecycle
Design Specification Prototype Compilers Manuals, textbooks

Specification Precise specification for language’s syntax and semantics must be written Informal or formal or hybrid Informal Formal Syntax English phrases BNF, EBNF Axiomatic method (based on mathematical logic) Semantics English phrases

Prototypes Cheap, low quality implementation
Highlights features of language that are hard to implement Try out language Interpreter might be a ghood prototype Interpretive compiler From source to abstract machine code

Compile-Time Error Reporting
Rejecting ill-formed programs Report location of each error with some explanation Distinguish between the major categories of compile-time errors: Syntactic error: missing or unexpected characters or tokens Indicate what characters or tokens were expected Scope error: a violation of the language’s scope rules Indicate which identifier was used twice, or used with declaration Type error: a violation of the language’s type rule Indicate which type rule was violated and/or what type was expected

Run-Time Error Reporting
Common run-time errors Arithmetic overflow Division by zero Out-of-range array indexing Can be detected only at run-time, because they depend on values computed at run-time

Final Exam Review Final Exam is comprehensive in that: Exam Structure
Essay questions will cover Chapters 5, 6, 7, 9 Problem oriented questions require knowledge from the entire semester Exam Structure Four questions Two essay questions Discuss Describe Two problems Develop code template for new language construct Determine identification table for given program Calculate size and address for given type(s) Compare & contrast Evaluate

Final Exam Review Chapter 5 – Contextual Analysis
Contextual analysis checks that the program conforms to the source language’s contextual constraints Scope rules Type rules Block Structure Monolithic Flat Nested Type Checking Literal Identifier Unary operator application Binary operator application Standard Environment

Final Exam Review Chapter 6 – Run-Time Organization
Key Issues Data representation Expression evaluation Storage allocation Routines Fundamental Principles of Data Representation Non-confusion: different values of a given type should have different representation Uniqueness: each value should always have same representation

Types Primitive types: cannot be decomposed Boolean Character Integer Records Disjoint unions Static arrays Dynamic arrays Recursive types For various types be able to determine size (storage required) and address (how to locate)

Expression Evaluation Stack machine Register machine Static storage allocation Global variables Stack storage allocation Local variables

Final Exam Review Chapter 7 – Code Generation
Translation of the source program to object code Dependent on source language and target machine Target Machines Registers, or stack, or both for intermediate results Instructions with zero, one, two, or three operands, or a mixture Single addressing mode, or many

Final Exam Review Chapter 7 – Code Generation
Code selection: which sequence of target machine instructions will be the object code for each phrase Storage allocation: deciding the storage address of each variable in source program Register allocation: should be used to hold intermediate results during expression evaluation

Final Exam Review Chapter 9 – Programming Language Life-Cycle
Design Specification Prototype Compilers Manuals, textbooks

Final Exam Review Chapter 9 – Programming Language Life-Cycle
Strive for simplicity and regularity Design Principles Type completeness: no operation should be arbitrarily restricted in the types of its operands Abstraction: for phrase that specifies some kind of computation, should be a way to abstract that phrase and parameterize it Correspondence: for each form of declaration there should be corresponding parameter mechanism Specifications Prototype Error Reporting Compile-time Run-time

Final Exam Review Structure of a Compiler

Course Outline Translators and Compilers Major Programming Project

Similar presentations

Presentation on theme: "Course Outline Translators and Compilers Major Programming Project"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Course Outline Translators and Compilers Major Programming Project

Similar presentations

Presentation on theme: "Course Outline Translators and Compilers Major Programming Project"— Presentation transcript:

Similar presentations

About project

Feedback