Introduction Fan Wu Department of Computer Science and Engineering

Introduction Fan Wu Department of Computer Science and Engineering
lec00-outline April 23, 2017 Introduction Fan Wu Department of Computer Science and Engineering Shanghai Jiao Tong University

Why study compiling? Importance: Influence:
Programs written in high-level languages have to be translated into binary codes before executing Reduce execution overhead of the programs Make high-performance computer architectures effective on users' programs Influence: Language Design Computer Architecture (influence is bi-directional) Techniques used influence other areas Text editors, information retrieval system, and pattern recognition programs Query processing system such as SQL Equation solver Natural Language Processing Debugging and finding security holes in codes … Compilers can help promote the use of high-level languages by minimizing the execution overhead of the programs written in these languages. Compilers are also critical in making high-performance computer architectures effective on users' applications. In fact, the performance of a computer system is so dependent on compiler technology that compilers are used as a tool in evaluating architectural concepts before a computer is built. In addition to the development of a compiler, the techniques used in compiler design can be applicable to many problems in computer science. Techniques used in a lexical analyzer can be used in text editors, information retrieval system, and pattern recognition programs. (e.g., spelling and grammar checking in Word) Techniques used in a parser can be used in a query processing system such as SQL. Many software having a complex front-end may need techniques used in compiler design. A symbolic equation solver which takes an equation as input. That program should parse the given input equation. Most of the techniques used in compiler design can be used in Natural Language Processing (NLP) systems.

lec00-outline Compiler Concept April 23, 2017 A compiler is a program that takes a program written in a source language and translates it into an equivalent program in a target language. COMPILER source program target program ( Normally the equivalent program in machine code relocatable object file) ( Normally a program written in a high-level programming language) Programming languages are notations for describing computations to people and to machines. The world as we know it depends on programming languages, because all the software running on all the computers was written in some programming language. But, before a program can be run, it first must be translated into a form in which it can be executed by a computer. The software systems that do this translation are called compilers. If the target program is an executable machine-language program, it can be called to process inputs and produce outputs. Both compiler and interpreter are language processor. What’s the difference between compiler and interpreter? (Page 2) Java language processor is a hybrid compiler. (Page 3) error messages

lec00-outline Interpreter April 23, 2017 An interpreter directly executes the operations specified in the source program on inputs supplied by the user. source program INTERPRETER output input error messages

Programming Languages
lec00-outline Programming Languages April 23, 2017 Compiled languages: Fortran, Pascal, C, C++, C#, Delphi, Visual Basic, … Interpreted languages: BASIC, Perl, PHP, Ruby, TCL, MATLAB,… Joint Compiled and Interpreted languages Java, Python, … BASIC (although the original version, Dartmouth BASIC, was compiled, as are many modern BASICs) A Java source program is first compiled into bytecodes. The bytecodes are then interpreted by a virtual machine.

Compiler vs. Interpreter
lec00-outline Compiler vs. Interpreter April 23, 2017 Preprocessing Compilers do extensive preprocessing Interpreters run programs “as is”, with little or no preprocessing Efficiency The target program produced by a compiler is usually much faster than interpreting the source codes Debugging An interpreter can give better error diagnostics than a compiler What happens if a compiler works in debug mode? In order to debug a program effectively, you need to generate debugging information when you compile it. This debugging information is stored in the object file; it describes the data type of each variable or function and the correspondence between source line numbers and addresses in the executable code.

Compiler Structure Target Language Source Language
Intermediate Language Front End – language specific Back End – machine specific Target Language Source Language Analysis Symbol Table Synthesis Retargeting is an attribute of software development tools that have been specifically designed to generate code for more than one computing platform. For example, converting the program from x86 machine to Sparc machine. Separation of Concerns Retargeting

lec00-outline Two Main Phases April 23, 2017 Analysis Phase: breaks up a source program into constituent pieces and produces an internal representation of it called intermediate code.  Synthesis Phase: translates the intermediate code into the target program. If the analysis part detects that the source program is either syntactically ill formed or semantically unsound, then it must provide informative messages, so the user can take corrective action. The analysis part also collects information about the source program and stores it in a data structure called a symbol table, which is passed along with the intermediate representation to the synthesis part. The synthesis part constructs the desired target program from the intermediate representation and the information in the symbol table.

Phases of Compilation Compilers work in a sequence of phases.
lec00-outline Phases of Compilation April 23, 2017 Compilers work in a sequence of phases. Each phase transforms the source program from one representation into another representation. They use the symbol table to store information of the entire source program. Lexical Analyzer Syntax Analyzer Semantic Analyzer Intermediate Code Generator Code Optimizer Code Generator Source Language Target Language Intermediate Analysis Synthesis Symbol Table Front end The front end analyzes the source code to build an internal representation of the program, called the intermediate representation or IR. It also manages the symbol table, a data structure mapping each symbol in the source code to associated information such as location, type and scope. This is done over several phases, which includes some of the following: Line reconstruction. Languages which strop their keywords or allow arbitrary spaces within identifiers require a phase before parsing, which converts the input character sequence to a canonical form ready for the parser. The top-down, recursive-descent, table-driven parsers used in the 1960s typically read the source one character at a time and did not require a separate tokenizing phase. Atlas Autocode, and Imp (and some implementations of ALGOL and Coral 66) are examples of stropped languages which compilers would have a Line Reconstruction phase. Lexical analysis breaks the source code text into small pieces called tokens. Each token is a single atomic unit of the language, for instance a keyword, identifier or symbol name. The token syntax is typically a regular language, so a finite state automaton constructed from a regular expression can be used to recognize it. This phase is also called lexing or scanning, and the software doing lexical analysis is called a lexical analyzer or scanner. Preprocessing. Some languages, e.g., C, require a preprocessing phase which supports macro substitution and conditional compilation. Typically the preprocessing phase occurs before syntactic or semantic analysis; e.g. in the case of C, the preprocessor manipulates lexical tokens rather than syntactic forms. However, some languages such as Scheme support macro substitutions based on syntactic forms. Syntax analysis involves parsing the token sequence to identify the syntactic structure of the program. This phase typically builds a parse tree, which replaces the linear sequence of tokens with a tree structure built according to the rules of a formal grammar which define the language's syntax. The parse tree is often analyzed, augmented, and transformed by later phases in the compiler. Semantic analysis is the phase in which the compiler adds semantic information to the parse tree and builds the symbol table. This phase performs semantic checks such as type checking (checking for type errors), or object binding (associating variable and function references with their definitions), or definite assignment (requiring all local variables to be initialized before use), rejecting incorrect programs or issuing warnings. Semantic analysis usually requires a complete parse tree, meaning that this phase logically follows the parsing phase, and logically precedes the code generation phase, though it is often possible to fold multiple phases into one pass over the code in a compiler implementation. Back end The term back end is sometimes confused with code generator because of the overlapped functionality of generating assembly code. Some literature uses middle end to distinguish the generic analysis and optimization phases in the back end from the machine-dependent code generators. The main phases of the back end include the following: Analysis: This is the gathering of program information from the intermediate representation derived from the input. Typical analyses are data flow analysis to build use-define chains, dependence analysis, alias analysis, pointer analysis, escape analysis etc. Accurate analysis is the basis for any compiler optimization. The call graph and control flow graph are usually also built during the analysis phase. Optimization: the intermediate language representation is transformed into functionally equivalent but faster (or smaller) forms. Popular optimizations are inline expansion, dead code elimination, constant propagation, loop transformation, register allocation and even automatic parallelization. Code generation: the transformed intermediate language is translated into the output language, usually the native machine language of the system. This involves resource and storage decisions, such as deciding which variables to fit into registers and memory and the selection and scheduling of appropriate machine instructions along with their associated addressing modes (see also Sethi-Ullman algorithm). Debug data may also need to be generated to facilitate debugging.

A Model of A Compiler Font End
lec00-outline A Model of A Compiler Font End April 23, 2017 Lexical analyzer reads the source program character by character and returns the tokens of the source program. Parser creates the tree-like syntactic structure of the given program. Intermediate-code generator translates the syntax tree into three-address codes. This slide shows a model of a compiler front end. We begin with the parser.

Lexical Analysis

<token-name, attribute-value>
lec00-outline Lexical Analysis April 23, 2017 Lexical Analyzer reads the source program character by character and returns the tokens of the source program. <token-name, attribute-value> A token describes a pattern of characters having the same meaning in the source program. (such as identifiers, operators, keywords, numbers, delimiters, and so on) First step: recognize words. <NUM, 60>

White Space Removal No blank, tab, newline, or comments in grammar
Skipping white space

lec00-outline Constants April 23, 2017 When a sequence of digits appears in the input stream, the lexical analyzer passes to the parser a token consisting of the terminal num along with an integer-valued attribute computed from the digits.  <num, 31><+><num, 28><+><num, 59> Simulate parsing some number ....

Keywords and Identifiers
lec00-outline Keywords and Identifiers April 23, 2017 Keywords: Fixed character strings used as punctuation marks or to identify constructs. Identifiers: A character string forms an identifier only if it is not a keyword. Keywords generally satisfy the rules for forming identifiers, so a mechanism is needed for deciding when a lexeme forms a keyword and when it forms an identifier. The problem is easier to resolve if keywords are reserved; i.e., if they cannot be used as identifiers. Then, a character string forms an identifier only if it is not a keyword. The lexical analyzer in this section solves two problems by using a table to hold character strings: • Single Representation. A string table can insulate the rest of the compiler from the representation of strings, since the phases of the compiler can work with references or pointers to the string in the table. References can also be manipulated more efficiently than the strings themselves. • Reserved Words. Reserved words can be implemented by initializing the string table with the reserved strings and their tokens. When the lexical analyzer reads a string or lexeme that could form an identifier, it first checks whether the lexeme is in the string table. If so, it returns the token from the table; otherwise, it returns a token with terminal id.

Lexical Analysis Cont’d
Puts information about identifiers into the symbol table. Regular expressions are used to describe tokens (lexical constructs). A (Deterministic) Finite State Automaton can be used in the implementation of a lexical analyzer.

Symbol Table

lec00-outline Symbol Table April 23, 2017 Symbol Tables are data structures that are used by compilers to hold information about the source-program constructs. For each identifier, there is an entry in the symbol table containing its information. Symbol tables need to support multiple declarations of the same identifier One symbol table per scope (of declaration)...  { int x; char y; { bool y; x; y; } x; y; }  Entries in the symbol table contain information about an identifier such as its character string (or lexeme) , its type, its position in storage, and any other relevant information. 下面的说明了x y的作用域，是要求编译器能够做到的界定。 x int y char y bool Outer symbol table Inner symbol table

lec00-outline Symbol Table April 23, 2017 A Symbol Table is a data structure containing a record for each variable name, with fields for the attributes of the name. position Id1 & attributes Initial Id2 & attributes rate Id3 & attributes An essential function of a compiler is to record the variable names used in the source program and collect information about various attributes of each name. These attributes may provide information about the storage allocated for a name, its type, its scope, and in the case of procedure names, such things as the number and types of its arguments, the method of passing each argument and the type returned.

lec00-outline Parsing April 23, 2017 A Syntax/Semantic Analyzer (Parser) creates the syntactic structure (generally a parse tree) of the given program. Parsing is the problem of taking a string of terminals and figuring out how to derive it from the start symbol of the grammar We now study, given a context free grammar, how to parse a string of terminals provided by a Lexical Analyzer.

lec00-outline Syntax Analysis April 23, 2017 A Syntax Analyzer/Parser creates the syntactic structure (generally a parse tree) of the given program. A parse tree describes a syntactic structure. Each interior node represents an operation The children of the node represent the arguments of the operation Once words are understood, the next step is to understand sentence structure. A typical representation is a syntax tree in which each interior node represents an operation and the children of the node represent the arguments of the operation. This tree shows the order in which the operations in the assignment are to be performed.

lec00-outline Syntax (CFG) April 23, 2017 The syntax of a language is specified by a context free grammar (CFG). The rules in a CFG are mostly recursive. A syntax analyzer checks whether a given program satisfies the rules implied by a CFG or not. If it satisfies, the syntax analyzer creates a parse tree for the given program. Ex: We use BNF (Backus Naur Form) to specify a CFG assgstmt -> identifier := expression expression -> identifier expression -> number expression -> expression + expression

lec00-outline Syntax Definition April 23, 2017 Context-Free Grammar (CFG) is used to specify the syntax of a formal language (for example a programming language like C, Java)  Grammar describes the structure (usually hierarchical) of programming languages. Example: in Java an IF statement should fit in if ( expression ) statement else statement statement  if ( expression ) statement else statement Note the recursive nature of statement. Production The arrow can be read as "can have the form”/”can be”. Such a rule is called a production. production: head/left side  body/right side

Definition of CFG Four components:
lec00-outline Definition of CFG April 23, 2017 Four components: A set of terminal symbols (tokens): elementary symbols of the language defined by the grammar A set of non-terminals (syntactic variables): represent the set of strings of terminals A set of productions: non-terminal  a sequence of terminals and/or non-terminals A designation of one of the non-terminals as the start symbol. production: head/left side  body/right side

List of digits separated by plus or minus signs
lec00-outline A Grammar Example April 23, 2017 List of digits separated by plus or minus signs  Accepts strings such as 9-5+2, 3-1, or 7. 0, 1, …, 9, +, - are the terminal symbols list and digit are non-terminals Every “line” is a production list is the start symbol Grouping: list → list + digit | list – digit | digit For notational convenience, productions with the same nonterminal as the head can have their bodies grouped, with the alternative bodies separated by the symbol |, which we read as "or.“ The ten productions for the nonterminal digit allow it to stand for any of the terminals 0, 1 , , 9. From production (3) , a single digit by itself is a list. Productions (1) and (2) express the rule that any list followed by a plus or minus sign and then another digit makes up a new list.

lec00-outline Derivations April 23, 2017 A grammar derives strings by beginning with the start symbol and repeatedly replacing a non-terminal by the body of a production Language: The terminal strings that can be derived from the start symbol defined by the grammar. Example: Derivation of 9-5+2 9 is a list, since 9 is a digit. 9-5 is a list, since 9 is a list and 5 is a digit. 9-5+2 is a list, since 9-5 is a list and 2 is a digit. Derivation: 推导 Process of derivation: list list + digit list - digit + 2 digit

lec00-outline Parse Trees April 23, 2017 A parse tree shows how the start symbol of a grammar derives a string in the language A  XYZ The result of the parsing process, or derivation of a string, can be represented by a parse tree. Parse Tree is generated during parsing. If nonterminal A has a production A -+ XYZ, then a parse tree may have an interior node labeled A with three children labeled X, Y, and Z, from left to right.

Parse Trees Properties
lec00-outline Parse Trees Properties April 23, 2017 The root is labeled by the start symbol. Each leaf is labeled by a terminal or by ε. Each interior node is labeled by a non-terminal. If A is the non-terminal labeling some interior node and X1, X2,… , Xn are the labels of the children of that node from left to right, then there must be a production A  X1X2 · · · Xn. ε stands for the empty string of symbols. Here, X1 , X2 , , Xn each stand for a symbol that is either a terminal or a non-terminal. 非常重要的一点，树中的节点左右是有顺序的。Any tree imparts a natural left-to-right order to its leaves.

Parse Tree for 9-5+2 lec00-outline April 23, 2017
The root is labeled list. The children of the root are labeled, from left to right, list, +, and digit. The left child of the root is similar to the root, with a child labeled - instead of +. The three nodes labeled digit each have one child that is labeled by a digit .

list  list + digit | list – digit | digit
lec00-outline Ambiguity April 23, 2017 A grammar can have more than one parse tree generating a given string of terminals. list  list + digit | list – digit | digit digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 string  string + string | string - string | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 (9-5)+2 = 6 9-5+2 9-(5+2) = 2 Parse tree is a good intermediate representation, but there may be a problem when considering the parse tree of a string according to a grammar. Ambiguity 二义性 Suppose we used a single nontertninal string and did not distinguish between digits and lists, as in Example 2.1. Fig. 2.6 shows that an expression like has more than one parse tree with this grammar. The two trees for correspond to the two ways of parenthesizing the expression: (9-5) +2 and 9- (5+2) . This second parenthesization gives the expression the unexpected value 2 rather than the customary value 6. The grammar of Example 2.1 does not permit this interpretation. 考虑：如何使得没有二义性？

Eliminating Ambiguity
lec00-outline Eliminating Ambiguity April 23, 2017 Operator Associativity: in most programming languages arithmetic operators have left associativity. Example: = (9+5)-2 Exception: Assignment operator = has right associativity: a=b=c is equivalent to a=(b=c) Operator Precedence: if an operator has higher precedence, then it will bind to it’s operands first. Example: * has higher precedence than +, therefore 9+5*2 = 9+(5*2) Associativity: 结合律 Precedence：优先级 When there are more than one kind of operator, we also need to define rules for relative precedence of operators. List -> list + digit | list – digit | digit (左结合) Digit -> 0|1|… 右结合是怎样的？

lec00-outline Parsing April 23, 2017 Parsing is the process of determining how a string of terminals can be generated by a grammar. Two classes: Top-down: construction of parse tree starts at the root and proceeds towards the leaves Bottom-up: construction of parse tree starts at the leaves and proceeds towards the root We now study how a simple parser do the job. Recall that Parsing is the process of determining how a string of terminals can be generated by a grammar. If it cannot be derived from the start symbol then report syntax errors within the string. Bottom-up parsing has wider usage, and can handle a larger class of grammars and translation schemes Top-down parsing is easy to understand. So here, we study an instance of top-down parsing. The popularity of top-down parsers is due to the fact that efficient parsers can be constructed more easily by hand using top-down methods. Bottom-up parsing, however, can handle a larger class of grammars and translation schemes, so software tools for generating parsers directly from grammars often use bottom-up methods.

Top-Down Parsing The top-down construction of a parse tree is done by starting from the root, and repeatedly performing the following two steps. At node N, labeled with non-terminal A, select the proper production of A and construct children at N for the symbols in the production body. Find the next node at which a subtree is to be constructed, typically the leftmost unexpanded non-terminal of the tree.

Top-Down Parsing lec00-outline April 23, 2017
lookahead symbol: the current terminal being scanned in the input. Problem: It is easy for human beings to identify the proper production to use in the top-down parsing. But for computer, it has to try each production until the right one is found. If the first picked production is found to be unsuitable, we have to backtrack to try another production. This is not efficient, because the lookahead cursor has to be rolled back and the subtree has to be reconstructed.

lec00-outline Predictive Parsing April 23, 2017 Recursive descent parsing: a top-down method of syntax analysis in which a set of recursive procedures is used to process the input.  Predictive parsing: a simple form of recursive-descent parsing Lookahead symbol unambiguously determines the flow of control based on the first terminal(s) of the nonterminal Here, we study a simple method to avoid backtrack. One procedure is associated with each nonterminal of a grammar. The sequence of procedure calls during the analysis of an input string implicitly defines a parse tree for the input, and can be used to build an explicit parse tree, if desired.

Procedure for stmt Necessary condition to use predictive parsing?
lec00-outline Procedure for stmt April 23, 2017 Necessary condition to use predictive parsing? No confliction on the first symbols of the bodies for the same head. Procedure stmt executes code corresponding to the production. In the code for the production body each terminal is matched with the lookahead symbol, and each nonterminal leads to a call of its procedure. Procedure match(t) compares its argument t with the lookahead symbol and advances to the next input terminal if they match. Thus match changes the value of variable lookahead, a global variable that holds the currently scanned input terminal. Necessary condition to use predictive parsing? No confliction on the first symbols of the bodies for the same head.

Left Recursion Elimination
lec00-outline Left Recursion Elimination April 23, 2017 Leftmost symbol of the body is the same as the nonterminal: A left-recursive production can be eliminated by rewriting the offending production: Predictive parsing relies on information about the first symbols that can be generated by a production body. So it is possible for a recursive-descent parser to loop forever. A problem arises with "left-recursive" productions like expr -+ expr + term where the leftmost symbol of the body is the same as the nonterminal at the head of the production. Suppose the procedure for expr decides to apply this production. The body begins with expr so the procedure for expr is called recursively. Since the lookahead symbol changes only when a terminal in the body is matched, no change to the input took place between recursive calls of expr. As a result, the second call to expr does exactly what the first call did, which means a third call to expr, and so on, forever.

Syntax Analyzer vs. Lexical Analyzer
lec00-outline Syntax Analyzer vs. Lexical Analyzer April 23, 2017 Both of them do similar things Granularity The lexical analyzer works on the characters to recognize the smallest meaningful units (tokens) in a source program. The syntax analyzer works on the smallest meaningful units (tokens) in a source program to recognize meaningful structures in the programming language. Recursion The lexical analyzer deals with simple non-recursive constructs of the language. The syntax analyzer deals with recursive constructs of the language. Explanations of recursion: Regular expressions / patterns of token are not used to define other tokens. But, symbols can be used to define other non-terminal symbols.

Semantic Analysis Semantic Analyzer
lec00-outline Semantic Analysis April 23, 2017 Semantic Analyzer adds semantic information to the parse tree (syntax-directed translation) checks the source program for semantic errors collects type information for the code generation type checking: check whether each operator has matching operands coercion: type conversion Once sentence structure is understood, we can try to understand “meaning”. The semantic analyzer uses the syntax tree and the information in the symbol table to check the source program for semantic consistency with the language definition. It also gathers type information and saves it in either the syntax tree or the symbol table, for subsequent use during intermediate-code generation. An important part of semantic analysis is type checking, where the compiler checks that each operator has matching operands. For example, many programming language definitions require an array index to be an integer; the compiler must report an error if a floating-point number is used to index an array.

lec00-outline Semantic Analysis April 23, 2017 A Semantic Analyzer checks the source program for semantic errors and collects the type information for the code generation. Type checking is an important part of semantic analysis. Syntax Tree Semantic Tree

Syntax-Directed Translation
lec00-outline Syntax-Directed Translation April 23, 2017 Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar. Infix expression  postfix expression Techniques: Attributes & Translation Schemes Based on the parse tree, we can do syntax-directed translation.

Postfix Notation Definition: Examples:
lec00-outline Postfix Notation April 23, 2017 Definition: If E is a variable or constant , E  E If E is an expression of the form E1 op E2, E1 op E2  E’1 E’2 op If E is a parenthesized expression of the form (E1), (E1)  E’1 Examples: 9-5+2  95-2+ 9-(5+2)  952+- No parentheses is needed in postfix notation, because the position and number of arguments of the operators permits only one decoding of a postfix expression. For example ... The "trick" is to repeatedly scan the postfix string from the left, until you find an operator. Then, look to the left for the proper number of operands, and group this operator with its operands. Evaluate the operator on the operands, and replace them by the result. Then repeat the process, continuing to the right and searching for another operator.

Attributes A syntax-directed definition
lec00-outline Attributes April 23, 2017 A syntax-directed definition associates attributes with non-terminals and terminals in a grammar attaches semantic rules to the productions of the grammar An attribute is said to be synthesized if its value at a parse-tree node is determined from attribute values of its children and itself. Using attributes is one way of syntax-directed translation. These rules describe how the attributes are computed.

Semantic Rules for Infix to Postfix
lec00-outline April 23, 2017 Annotated Parse Tree 9-5+2  95-2+ Syntax-directed definition The annotated parse tree is based on the syntax-directed definition for translating expressions consisting of digits separated by plus or minus signs into postfix notation. Each nonterminal has a string-valued attribute t that represents the postfix notation for the expression generated by that nonterminal in a parse tree. The symbol || in the semantic rule is the operator for string concatenation. The postfix form of a digit is the digit itself. When the production expr -+ term is applied, the value of term.t becomes the value of expr.t. The production expr -+ exprl + term derives an expression containing a plus operator. The left operand of the plus operator is given by exprl and the right operand by term. The semantic rule expr.t = expr1 .t || term.t || '+' associated with this production constructs the value of attribute expr.t by concatenating the postfix forms exprl.t and term.t of the left and right operands, respectively, and then appending the plus sign. This rule is a formalization of the definition of "postfix expression."

lec00-outline Translation Schemes April 23, 2017 A Syntax-Directed Translation Scheme is a notation for specifying a translation by attaching program fragments to productions in a grammar. The program fragments are called semantic actions. We have seen the way of using attributes for syntax-directed translation. We now consider an alterative approach by executing program fragments. The position at which an action is to be executed is shown by enclosing it between curly braces and writing it within the production body. When drawing a parse tree for a translation scheme, we indicate an action by constructing an extra child for it, connected by a dashed line to the node that corresponds to the head of the production. The node for a semantic action has no children, so the action is performed when that node is first seen.

A Translation Scheme 9-5+2  95-2+ Parse tree Translation scheme
lec00-outline April 23, 2017 9-5+2  95-2+ Parse tree Translation scheme The parse tree in Fig has print statements at extra leaves, which are attached by dashed lines to interior nodes of the parse tree. The translation scheme appears in Fig The underlying grammar generates expressions consisting of digits separated by plus and minus signs. The actions embedded in the production bodies translate such expressions into postfix notation, provided we perform a left-to-right depth-first traversal of the tree and execute each print statement when we visit its leaf. The root of Fig represents the first production in Fig In a postorder traversal, we first perform all the actions in the leftmost subtree of the root, for the left operand, also labeled expr like the root. We then visit the leaf + at which there is no action. We next perform the actions in the subtree for the right operand term and, finally, the semantic action { print('+') } at the extra node. Since the productions for term have only a digit on the right side, that digit is printed by the actions for the productions. No output is necessary for the production expr --+ term, and only the operator needs to be printed in the action for each of the first two productions. When executed during a postorder traversal of the parse tree, the actions in Fig print

Attribute vs. Translation Scheme
lec00-outline Attribute vs. Translation Scheme April 23, 2017 Syntax-directed attribute attaches strings as attributes to the nodes in the parse tree Syntax-directed translation scheme prints the translation incrementally, through semantic actions space issue

lec00-outline Parsing Techniques April 23, 2017 Depending on how the parse tree is created, there are different parsing techniques. These parsing techniques are categorized into two groups: Top-Down Parsing, Bottom-Up Parsing Top-Down Parsing: Construction of the parse tree starts at the root, and proceeds towards the leaves. Efficient top-down parsers can be easily constructed by hand. Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing). Bottom-Up Parsing: Construction of the parse tree starts at the leaves, and proceeds towards the root. Normally efficient bottom-up parsers are created with the help of some software tools. Bottom-up parsing is also known as shift-reduce parsing. Operator-Precedence Parsing – simple, restrictive, easy to implement LR Parsing – much general form of shift-reduce parsing, LR, SLR, LALR Ch4

lec00-outline A Simple Translator April 23, 2017 Semantic actions embedded in the productions are simply carried along in the transformation, as if they were terminals. Grammar of List of digits separated by plus or minus signs

Translation of 9-5+2 to 95-2+ Left-recursion eliminated lec00-outline
April 23, 2017 Left-recursion eliminated

Procedures for Simple Translator
lec00-outline Procedures for Simple Translator April 23, 2017 如果用生成树节点的方法代替print，那么就可以产生一棵语法树。 Why using while loop in rest() function? Rest appear again at the end of its production body.

(Abstract) Syntax Trees
lec00-outline (Abstract) Syntax Trees April 23, 2017 In an (abstract) syntax tree for an expression each interior node represents an operator the children of the node represent the operands of the operator. In the syntax tree, interior nodes represent programming constructs. In the parse tree, the interior nodes represent nonterminals. Syntax tree for 9-5+2 In the following, we first study a translation scheme that constructs syntax trees, and later, study how the scheme can be modified to emit three-address code.

lec00-outline Syntax vs. Semantics April 23, 2017 The syntax of a programming language describes the proper form of its programs.  The semantics of the language defines what its programs mean, what each program does when it executes. syntax 语法 semantics 语义

Intermediate Code Generation
lec00-outline Intermediate Code Generation April 23, 2017 Here, we consider intermediate representations for expressions and statements, and give examples of how to produce such representations.

lec00-outline Intermediate Code Generation April 23, 2017 A compiler may produce an explicit intermediate codes representing the source program. These intermediate codes are generally machine (architecture) independent. But the level of intermediate codes is close to the level of machine codes. Ex: three-address code x = y op z This form of intermediate code takes its name from instructions of the form x = y op z, where op is a binary operator, y and z the are addresses for the operands, and x is the address for the result of the operation. A three-address instruction carries out at most one operation, typically a computation, a comparison, or a branch. There are several points worth noting about three-address instructions: First, each three-address assignment instruction has at most one operator on the right side. Thus, these instructions fix the order in which operations are to be done; the multiplication precedes the addition in the source program. Second, the compiler must generate a temporary name to hold the value computed by a three-address instruction. Third, some "three-address instructions" like the first and last in the sequence have fewer than three operands.

lec00-outline Intermediate Code Generation April 23, 2017 The front end of a compiler constructs an intermediate representation of the source program from which the back end generates the target program. Two kinds of intermediate representations Tree: parse trees and (abstract) syntax trees Linear representation: three-address code

Syntax Trees For Statement
lec00-outline Syntax Trees For Statement April 23, 2017 stmt -> while ( expr ) stmt  { stmt.n = new While(expr.n, stmt.n } n is a node in the syntax tree For each statement construct, we define an operator in the abstract syntax. For constructs that begin with a keyword, we use the keyword for the operator.

Syntax Trees For Expressions

Static Checking Done by a compiler front end
lec00-outline Static Checking April 23, 2017 Done by a compiler front end To check that the program follows the syntactic and semantic rules Syntactic checking Type checking In addition to creating an intermediate representation, a compiler front end checks that the source program follows the syntactic and semantic rules of the source language. This checking is called static checking. Static checking assures that certain kinds of programming errors, including type mismatches, are detected and reported during compilation. Syntactic Checking. For example, constraints such as an identifier being declared at most once in a scope, or that a break statement must have an enclosing loop or switch statement, are syntactic, although they are not encoded in, or enforced by, a grammar used for parsing. L-values and R-values: Static checking must assure that the left side of an assignment is a variable denoting where the value computed is to be stored. 2. Type Checking. The type rules of a language assure that an operator or function is applied to the right number and type of operands. If conversion between types is necessary, e.g., when an integer is added to a float, then the type-checker can insert an operator into the syntax tree to represent that conversion. We discuss type conversion, using the common term "coercion". When does coercion occur usually?

lec00-outline Three-Address Codes April 23, 2017 Three-address code is a sequence of instructions of the form x = y op z Arrays will be handled by using the following two variants of instructions: x [ y ] = z x = y [ z ] Instructions for control flow: ifFalse x goto L ifTrue x goto L goto L Instruction for copying value x = y Once syntax trees are constructed, we can travel syntax trees to generate three-address code. Specifically, we show how to produce three-address code. x=y op z where x, y, and z are names, constants, or compiler-generated temporaries; and op stands for an operator. Three-address instructions are executed in sequentially unless a jump instruction is used.

Translation of Statements
Use jump instructions to implement the flow of control through the statement. The translation of if expr then stmtl

Translation of Statements
lec00-outline Translation of Statements April 23, 2017 The constructor If creates syntax-tree nodes for if-statements. It is called with two parameters, an expression node x and a statement node y, which it saves as attributes E and S. The constructor also assigns attribute after a unique new label, by calling function newlabel(). Once the entire syntax tree for a source program is constructed, the function gen() is called at the root of the syntax tree. The pseudo-code for function gen of class If is representative. It calls E.rvalue to translate the expression E (the boolean-valued expression that is part of the if-statements) and saves the result node returned by E. Function gen() then emits a conditional jump and calls S.gen() to translate the substatement S.

Functions lvalue and rvalue
lec00-outline Functions lvalue and rvalue April 23, 2017 a = a + 1, a is computed differently for the l-value and r-value Two functions used to distinguish them: lvalue: generates instructions to compute the subtrees below x, and returns a node representing the “address” for x rvalue: generates the instructions to compute x into a temporary, and returns a new node representing the temporary. R-values is what we usually think of as “values” while L-values are “locations” Rvalue: 产生一个新节点。Function rvalue in generates instructions and returns a possibly new node. Lvalue：不产生新节点。对于identifier，简单的返回树的节点； When applied to a node x, function lvalue simply returns x if it is the node for an identifier (i.e., if x is of class Id). 在我们这里，唯一的例外是：对于数组，应该是一个表达式，但是数组表达式有l-value， In our simple language, the only other case where an expression has an l-value occurs when x represents an array access, such as a [i] . 比如：x=a [i] . In this case, x will have the form Access(y, z ) , where class Access is a subclass of Expr, y represents the name of the accessed array, and z represents the offset (index) of the chosen element in that array. Function lvalue calls rvalue(z) to generate instructions, if needed, to compute the r-value of z. 在Access（y，z)中，由于要计算z的右值，所以当z是表达式的时候，会产生新节点。（见rvalue语义）

Translation of Expressions
lec00-outline Translation of Expressions April 23, 2017 Approach: No code is generated for identifiers and constants If a node x of class Expr has operator op, then an instruction is emitted to compute the value at node x into a temporary. Expression: i-j+k translates into t1 = i-j t2 = t1+k Expression: 2 * a[i] translates into t1 = a [ i ] t2 = 2 * t1 * Do not use a temporary in place of a[i], if a[i] appears on the left side of an assignment.

Translation of Expressions
Example:

Test Yourself Generate three-address codes for
lec00-outline Test Yourself April 23, 2017 Generate three-address codes for If(x[2*a]==y[b]) x[2*a+1]=y[b+1]; t4=2*a t2=x[t4] t3=y[b] t1= t2 == t3 ifFalse t1 goto after t5=t4+1 t7=b+1 t6=y[t7] x[t5]=t6 after: 创建语法树，写出中间代码 If.gen()->E.rvalue()->==操作… 注意临时变量的顺序 1: t4=2*a 2: t2=x[t4] 3: t3=y[b] 4: t1= t2 == t3 4: ifFalse t1 goto 9 5: t5=t4+1 6: t7=b+1 7: t6=y[t7] 8: x[t5]=t6 9:

Code Optimization The code optimizer optimizes the code produced by the intermediate code generator in the terms of time and space.

lec00-outline Code Generation April 23, 2017 The code generator takes as input an intermediate representation of the source program and maps it into the target language. Example: MOVE id3, R1 MULT #60.0, R1 ADD id2, R1 MOVE R1, id1 Produces the target language in a specific architecture. The target program is normally a re-locatable object file containing the machine codes.

Issues Driving Compiler Design
Correctness Speed (runtime and compile time) Degrees of optimization Multiple passes Space Feedback to user Debugging

Tools Lexical Analysis – LeX, FLeX, JLeX
Syntax Anaysis – Yacc, JavaCC, SableCC Semantic Analysis – Yacc, JavaCC, SableCC

Homework Reading Chapter 1 and 2

Introduction Fan Wu Department of Computer Science and Engineering

Similar presentations

Presentation on theme: "Introduction Fan Wu Department of Computer Science and Engineering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction Fan Wu Department of Computer Science and Engineering

Similar presentations

Presentation on theme: "Introduction Fan Wu Department of Computer Science and Engineering"— Presentation transcript:

Similar presentations

About project

Feedback