Lexical and Syntax Analysis Chapter 4

Slides:



Advertisements
Similar presentations
1 Mooly Sagiv and Greta Yorsh School of Computer Science Tel-Aviv University Modern Compiler Design.
Advertisements

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Compilers and Language Translation
Semantic analysis Parsing only verifies that the program consists of tokens arranged in a syntactically-valid combination, we now move on to semantic analysis,
CPSC Compiler Tutorial 9 Review of Compiler.
Prof. Hilfinger CS164 Lecture 191 Type Checking Lecture 19 (from notes by G. Necula)
Chapter 4 Lexical and Syntax Analysis Sections 1-4.
IV. Overview of Compilation CS 321. Overview of Compilation n Translating from high-level language to machine code is organized into several phases or.
Context-Free Grammars Lecture 7
Slide1 Chapter 4 Lexical and Syntax Analysis. slide2 OutLines: In this chapter a major topics will be discussed : Introduction to lexical analysis, including.
Type Checking in Cool Alex Aiken (Modified by Mooly Sagiv)
Reference Book: Modern Compiler Design by Grune, Bal, Jacobs and Langendoen Wiley 2000.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
Cs164 Prof. Bodik, Fall Symbol Tables and Static Checks Lecture 14.
Lexical and syntax analysis
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
Course Revision Contents  Compilers  Compilers Vs Interpreters  Structure of Compiler  Compilation Phases  Compiler Construction Tools  A Simple.
Topic #3: Lexical Analysis
INTRODUCTION TO COMPUTING CHAPTER NO. 06. Compilers and Language Translation Introduction The Compilation Process Phase 1 – Lexical Analysis Phase 2 –
Lexical Analysis Natawut Nupairoj, Ph.D.
COP4020 Programming Languages
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
1 Programming Languages Tevfik Koşar Lecture - II January 19 th, 2006.
CSC 338: Compiler design and implementation
CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 2.
Lexical Analysis Hira Waseem Lecture
1 COMP 3438 – Part II-Lecture 1: Overview of Compiler Design Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
D. M. Akbar Hussain: Department of Software & Media Technology 1 Compiler is tool: which translate notations from one system to another, usually from source.
Lexical and Syntax Analysis
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
CH3.1 CS 345 Dr. Mohamed Ramadan Saady Algebraic Properties of Regular Expressions AXIOMDESCRIPTION r | s = s | r r | (s | t) = (r | s) | t (r s) t = r.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Chapter 2. Design of a Simple Compiler J. H. Wang Sep. 21, 2015.
CPS 506 Comparative Programming Languages Syntax Specification.
Topic #1: Introduction EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
Chapter 1 Introduction Study Goals: Master: the phases of a compiler Understand: what is a compiler Know: interpreter,compiler structure.
The Functions and Purposes of Translators Syntax (& Semantic) Analysis.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 3: Introduction to Syntactic Analysis.
Introduction to Compiling
Introduction CPSC 388 Ellen Walker Hiram College.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Design Introduction 1. 2 Course Outline Introduction to Compiling Lexical Analysis Syntax Analysis –Context Free Grammars –Top-Down Parsing –Bottom-Up.
Overview of Previous Lesson(s) Over View  Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar. 
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
1 Compiler & its Phases Krishan Kumar Asstt. Prof. (CSE) BPRCE, Gohana.
ISBN Chapter 4 Lexical and Syntax Analysis.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
Compiler Construction CPCS302 Dr. Manal Abdulaziz.
LECTURE 3 Compiler Phases. COMPILER PHASES Compilation of a program proceeds through a fixed series of phases.  Each phase uses an (intermediate) form.
Overview of Compilation Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida Programming Language Principles Lecture 2.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 10 Semantic Analysis. REVIEW So far, we’ve covered the following: Compilation methods: compilation vs. interpretation. The overall compilation.
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Compiler Design (40-414) Main Text Book:
Introduction to Compiler Construction
A Simple Syntax-Directed Translator
Compiler Construction (CS-636)
Lexical and Syntax Analysis
Chapter 3: Lexical Analysis
R.Rajkumar Asst.Professor CSE
CS 3304 Comparative Languages
CS 3304 Comparative Languages
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Presentation transcript:

Lexical and Syntax Analysis Chapter 4

Compilation Translating from high-level language to machine code is organized into several phases or passes. In the early days passes communicated through files, but this is no longer necessary.

Language Specification We must first describe the language in question by giving its specification. Syntax: Defines symbols (vocabulary) Defines programs (sentences) Semantics: Gives meaning to sentences. The formal specifications are often the input to tools that build translators automatically.

Compiler passes String of characters String of tokens Lexical Analyzer String of tokens Parser Abstract syntax tree Semantic Analyzer Abstract syntax tree Source-to-source optimizer Abstract syntax tree Abstract syntax tree Translator Medium-level intermediate code Optimizer Low-level intermediate code Medium-level intermediate code Optimizer Low-level intermediate code Translator Low-level intermediate code Final Assembly Executable/object code

Compiler passes source program front end Lexical scanner Parser semantic analyzer symbol table manager error handler Translator Optimizer back end Final assembly target program

Lexical analyzer Also called a scanner or tokenizer Converts stream of characters into a stream of tokens Tokens are: Keywords such as for, while, and class. Special characters such as +, -, (, and < Variable name occurrences Constant occurrences such as 1, 0, true.

Comparison with Lexical Analysis Phase Input Output Lexer Sequence of characters Sequence of tokens Parser Parse tree

Lexical analyzer The lexical analyzer is usually a subroutine of the parser. Each token is a single entity. A numerical code is usually assigned to each type of token.

Lexical analyzer Lexical analyzers perform: Line reconstruction delete comments delete white spaces perform text substitution Lexical translation: translation of lexemes -> tokens Often additional information is affiliated with a token.

Parser Performs syntax analysis Imposes syntactic structure on a sentence. Parse trees are used to expose the structure. These trees are often not explicitly built Simpler representations of them are often used Parsers, accepts a string of tokens and builds a parse tree representing the program

Parser The collection of all the programs in a given language is usually specified using a list of rules known as a context free grammar.

Parser A grammar has four components: A set of tokens known as terminal symbols A set of variables or non-terminals A set of productions where each production consists of a non-terminal, an arrow, and a sequence of tokens and/or non-terminals A designation of one of the nonterminals as the start symbol.

Symbol Table Management The symbol table is a data structure used by all phases of the compiler to keep track of user defined symbols and keywords. During early phases (lexical and syntax analysis) symbols are discovered and put into the symbol table During later phases symbols are looked up to validate their usage.

Symbol Table Management Typical symbol table activities: add a new name add information for a name access information for a name determine if a name is present in the table remove a name revert to a previous usage for a name (close a scope).

Symbol Table Management Many possible Implementations: linear list sorted list hash table tree structure

Symbol Table Management Typical information fields: print value kind (e.g. reserved, typeid, varid, funcid, etc.) block number/level number type initial value base address etc.

Abstract Syntax Tree The parse tree is used to recognize the components of the program and to check that the syntax is correct. As the parser applies productions, it usually generates the component of a simpler tree (known as Abstract Syntax Tree). The meaning of the component is derived out of the way the statement is organized in a subtree.

Semantic Analyzer The semantic analyzer completes the symbol table with information on the characteristics of each identifier. The symbol table is usually initialized during parsing. One entry is created for each identifier and constant. Scope is taken into account. Two different variables with the same name will have different entries in the symbol table. The semantic analyzer completes the table using information from declarations.

Semantic Analyzer The semantic analyzer does Type checking Flow of control checks Uniqueness checks (identifiers, case labels, etc.) One objective is to identify semantic errors statically. For example: Undeclared identifiers Unreachable statements Identifiers used in the wrong context. Methods called with the wrong number of parameters or with parameters of the wrong type.

Semantic Analyzer Some semantic errors have to be detected at run time. The reason is that the information may not be available at compile time. Array subscript is out of bonds. Variables are not initialized. Divide by zero.

Error Management Errors can occur at all phases in the compiler Invalid input characters, syntax errors, semantic errors, etc. Good compilers will attempt to recover from errors and continue.

Translator The lexical scanner, parser, and semantic analyzer are collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST.

Translator Rather than generate code for a specific architecture, most compilers generate intermediate language Three address code is popular. Really a flattened tree representation. Simple. Flexible (captures the essence of many target architectures). Can be interpreted.

Translator One way of performing intermediate code generation: Attach meaning to each node of the AST. The meaning of the sentence = the “meaning” attached to the root of the tree.

XIL An example of Medium level intermediate language is XIL. XIL is used by IBM to compile FORTRAN, C, C++, and Pascal for RS/6000. Compilers for Fortran 90 and C++ have been developed using XIL for other machines such as Intel 386, Sparc, and S/370.

Optimizers Intermediate code is examined and improved. Can be simple: changing “a:=a+1” to “increment a” changing “3*5” to “15” Can be complicated: reorganizing data and data accesses for cache efficiency Optimization can improve running time by orders of magnitude, often also decreasing program size.

Code Generation Generation of “real executable code” for a particular target machine. It is completed by the Final Assembly phase Final output can either be assembly language for the target machine object code ready for linking The “target machine” can be a virtual machine (such as the Java Virtual Machine, JVM), and the “real executable code” is “virtual code” (such as Java Bytecode).

Compiler Overview Source Program Token Sequence Syntax Tree IF (a<b) THEN c=1*d; Lexical Analyzer IF ( ID “a” < ID “b” THEN ID “c” = CONST “1” * ID “d” Token Sequence Syntax Analyzer a cond_expr < b Syntax Tree IF_stmt lhs c list 1 assign_stmt rhs Semantic Analyzer * d GE a, b, L1 MUlT 1, d, c L1: 3-Address Code GE a, b, L1 MOV d, c L1: Code Optimizer loadi R1,a cmpi R1,b jge L1 loadi R1,d storei R1,c L1: Optimized 3-Addr. Code Code Generation Assembly Code

Lexical Analysis

What is Lexical Analysis? The lexical analyzer deals with small-scale language constructs, such as names and numeric literals. The syntax analyzer deals with the large-scale constructs, such as expressions, statements, and program units. - The syntax analysis portion consists of two parts: 1. A low-level part called a lexical analyzer (essentially a pattern matcher). 2. A high-level part called a syntax analyzer, or parser. The lexical analyzer collects characters into logical groupings and assigns internal codes to the groupings according to their structure.

Lexical Analyzer in Perspective parser symbol table source program token get next token

Lexical Analyzer in Perspective Scan Input Remove white space, … Identify Tokens Create Symbol Table Insert Tokens into AST Generate Errors Send Tokens to Parser PARSER Perform Syntax Analysis Actions Dictated by Token Order Update Symbol Table Entries Create Abstract Rep. of Source Generate Errors

Lexical analyzers extract lexemes from a given input string and produce the corresponding tokens. Sum = oldsum – value /100; Token Lexeme IDENT sum ASSIGN_OP = IDENT oldsum SUBTRACT_OP - IDENT value DIVISION_OP / INT_LIT 100 SEMICOLON ;

Basic Terminology What are Major Terms for Lexical Analysis? TOKEN A classification for a common set of strings Examples Include <Identifier>, <number>, etc. PATTERN The rules which characterize the set of strings for a token LEXEME Actual sequence of characters that matches pattern and is classified by a token Identifiers: x, count, name, etc…

Informal Description of Pattern Basic Terminology Token Sample Lexemes Informal Description of Pattern const if relation id num literal <, <=, =, < >, >, >= pi, count, D2 3.1416, 0, 6.02E23 “core dumped” < or <= or = or < > or >= or > letter followed by letters and digits any numeric constant any characters between “ and “ except “ Classifies Pattern Actual values are critical. Info is : 1. Stored in symbol table 2. Returned to parser

Token Definitions Suppose: S ts the string banana Prefix : ban, banana Suffix : ana, banana Substring : nan, ban, ana, banana Subsequence: bnan, nn

Token Definitions letter  A | B | C | … | Z | a | b | … | z digit  0 | 1 | 2 | … | 9 id  letter ( letter | digit )* Shorthand Notation: “+” : one or more r* = r+ |  & r+ = r r* “?” : zero or one r?=r |  [range] : set range of characters (replaces “|” ) [A-Z] = A | B | C | … | Z id  [A-Za-z][A-Za-z0-9]*

What language construct are they used for ? Token Recognition Assume Following Tokens: if, then, else, re-loop, id, num What language construct are they used for ? Given Tokens, What are Patterns ? Grammar: stmt  |if expr then stmt |if expr then stmt else stmt | expr  term re-loop term | term term  id | num if  if then  then else  else Re-loop  < | <= | > | >= | = | <> id  letter ( letter | digit )* num  digit + (. digit + ) ? ( E(+ | -) ? digit + ) ? What does this represent ?

What Else Does Lexical Analyzer Do? Scan away b, nl, tabs Can we Define Tokens For These? blank  b tab  ^T newline  ^M delim  blank | tab | newline ws  delim +

Symbol Tables ws if then else id num < <= = < > > >= Regular Expression Token Attribute-Value ws if then else id num < <= = < > > >= - relop pointer to table entry LT LE EQ NE GT GE Note: Each token has a unique token identifier to define category of lexemes

Building a Lexical Analyzer There are three approaches to building a lexical analyzer: 1. Write a formal description of the token patterns of the language using a descriptive language. Tool on UNIX system called lex 2. Design a state transition diagram that describes the token patterns of the language and write a program that implements the diagram. 3. Design a state transition diagram and hand-construct a table-driven implementation of the state diagram.

Diagrams for Tokens Transition Diagrams (TD) are used to represent the tokens Each Transition Diagram has: States : Represented by Circles Actions : Represented by Arrows between states Start State : Beginning of a pattern (Arrowhead) Final State(s) : End of pattern (Concentric Circles) Deterministic - No need to choose between 2 different actions

Example : Transition Diagrams 19 12 14 13 16 15 18 17 start other digit . E + | - * start digit 20 * . 21 24 other 23 22 start digit 25 other 27 26 *

State diagram to recognize names, reserved words, and integer literals

Reasons to use BNF to Describe Syntax Provides a clear syntax description The parser can be based directly on the BNF Parsers based on BNF are easy to maintain

Reasons to Separate Lexical and Syntax Analysis Simplicity - less complex approaches can be used for lexical analysis; separating them simplifies the parser Efficiency - separation allows optimization of the lexical analyzer Portability - parts of the lexical analyzer may not be portable, but the parser always is portable

Summary of Lexical Analysis A lexical analyzer is a pattern matcher for character strings A lexical analyzer is a “front-end” for the parser Identifies substrings of the source program that belong together - lexemes Lexemes match a character pattern, which is associated with a lexical category called a token - sum is a lexeme; its token may be IDENT

Semantic Analysis Intro to Type Checking

The Compiler So Far Lexical analysis Parsing Semantic analysis Detects inputs with illegal tokens Parsing Detects inputs with ill-formed parse trees Semantic analysis The last “front end” phase Catches more errors

What’s Wrong? Example 1 int in x; Example 2 int i = 12.34;

Why a Separate Semantic Analysis? Parsing cannot catch some errors Some language constructs are not context-free Example: All used variables must have been declared (i.e. scoping) Example: A method must be invoked with arguments of proper type (i.e. typing)

What Does Semantic Analysis Do? Checks of many kinds: All identifiers are declared Types Inheritance relationships Classes defined only once Methods in a class defined only once Reserved identifiers are not misused And others . . . The requirements depend on the language

Scope Matching identifier declarations with uses Important semantic analysis step in most languages

Scope (Cont.) The scope of an identifier is the portion of a program in which that identifier is accessible The same identifier may refer to different things in different parts of the program Different scopes for same name don’t overlap An identifier may have restricted scope

Static vs. Dynamic Scope Most languages have static scope Scope depends only on the program text, not run-time behavior C has static scope A few languages are dynamically scoped Lisp, COBOL Current Lisp has changed to mostly static scoping Scope depends on execution of the program

Class Definitions Class names can be used before being defined We can’t check this property using a symbol table or even in one pass Solution Pass 1: Gather all class names Pass 2: Do the checking Semantic analysis requires multiple passes Probably more than two

Types What is a type? Consensus The notion varies from language to language Consensus A set of values A set of operations on those values Classes are one instantiation of the modern notion of type

Why Do We Need Type Systems? Consider the assembly language fragment addi $r1, $r2, $r3 What are the types of $r1, $r2, $r3?

Types and Operations Certain operations are legal for values of each type It doesn’t make sense to add a function pointer and an integer in C It does make sense to add two integers But both have the same assembly language implementation!

Type Systems A language’s type system specifies which operations are valid for which types The goal of type checking is to ensure that operations are used with the correct types Enforces intended interpretation of values, because nothing else will! Type systems provide a concise formalization of the semantic checking rules

What Can Types do For Us? Can detect certain kinds of errors Memory errors: Reading from an invalid pointer, etc. Violation of abstraction boundaries: class FileSystem { open(x : String) : File { … } class Client { f(fs : FileSystem) { File fdesc <- fs.open(“foo”) … } -- f cannot see inside fdesc ! }

Type Checking Overview Three kinds of languages: Statically typed: All or almost all checking of types is done as part of compilation (C and Java) Dynamically typed: Almost all checking of types is done as part of program execution (Scheme) Untyped: No type checking (machine code)

The Type Wars Competing views on static vs. dynamic typing Static typing proponents say: Static checking catches many programming errors at compile time Avoids overhead of runtime type checks Dynamic typing proponents say: Static type systems are restrictive Rapid prototyping easier in a dynamic type system

The Type Wars (Cont.) In practice, most code is written in statically typed languages with an “escape” mechanism Unsafe casts in C, Java It’s debatable whether this compromise represents the best or worst of both worlds

Type Checking and Type Inference Type Checking is the process of verifying fully typed programs Type Inference is the process of filling in missing type information The two are different, but are often used interchangeably

Rules of Inference We have seen two examples of formal notation specifying parts of a compiler Regular expressions (for the lexer) Context-free grammars (for the parser) The appropriate formalism for type checking is logical rules of inference

Why Rules of Inference? Inference rules have the form If Hypothesis is true, then Conclusion is true Type checking computes via reasoning If E1 and E2 have certain types, then E3 has a certain type Rules of inference are a compact notation for “If-Then” statements

From English to an Inference Rule The notation is easy to read (with practice) Start with a simplified system and gradually add features Building blocks Symbol Ù is “and” Symbol Þ is “if-then” x:T is “x has type T”

From English to an Inference Rule (2) If e1 has type Int and e2 has type Int, then e1 + e2 has type Int (e1 has type Int Ù e2 has type Int) Þ e1 + e2 has type Int (e1: Int Ù e2: Int) Þ e1 + e2: Int

From English to an Inference Rule (3) The statement (e1: Int Ù e2: Int) Þ e1 + e2: Int is a special case of ( Hypothesis1 Ù . . . Ù Hypothesisn ) Þ Conclusion This is an inference rule

Notation for Inference Rules By tradition inference rules are written Type rules can also have hypotheses and conclusions of the form: ` e : T ` means “it is provable that . . .” ` Hypothesis1 … ` Hypothesisn ` Conclusion

Two Rules i is an integer ` i : Int ` e1 : Int ` e2 : Int ` e1 + e2 : Int [Add]

Two Rules (Cont.) These rules give templates describing how to type integers and + expressions By filling in the templates, we can produce complete typings for expressions

Example: 1 + 2 1 is an integer 2 is an integer ` 1 : Int ` 2 : Int

Soundness A type system is sound if We only want sound rules Whenever ` e : T Then e evaluates to a value of type T We only want sound rules But some sound rules are better than others: i is an integer ` i : Object

Type Checking Proofs Type checking proves facts e : T Proof is on the structure of the AST Proof has the shape of the AST One type rule is used for each kind of AST node In the type rule used for a node e: Hypotheses are the proofs of types of e’s sub-expressions Conclusion is the proof of type of e Types are computed in a bottom-up pass over the AST

Rules for Constants ` false : Bool s is a string constant ` s : String

` while e1 loop e2 pool : Object Two More Rules ` e : Bool ` not e : Bool [Not] ` e1 : Bool ` e2 : T ` while e1 loop e2 pool : Object [Loop]

A Problem What is the type of a variable reference? The local, structural rule does not carry enough information to give x a type. x is an identifier ` x : ? [Var]

Notes The type environment gives types to the free identifiers in the current scope The type environment is passed down the AST from the root towards the leaves Types are computed up the AST from the leaves towards the root

Expressiveness of Static Type Systems A static type system enables a compiler to detect many common programming errors The cost is that some correct programs are disallowed Some argue for dynamic type checking instead Others argue for more expressive static type checking But more expressive type systems are also more complex

Dynamic And Static Types The dynamic type of an object is the class C that is used in the “new C” expression that creates the object A run-time notion Even languages that are not statically typed have the notion of dynamic type The static type of an expression is a notation that captures all possible dynamic types the expression could take A compile-time notion

Dynamic And Static Types The typing rules use very concise notation They are very carefully constructed Virtually any change in a rule either: Makes the type system unsound (bad programs are accepted as well typed) Or, makes the type system less usable (perfectly good programs are rejected) But some good programs will be rejected anyway The notion of a good program is undecidable

Type Systems Types are a play between flexibility and safety Type rules are defined on the structure of expressions Types of variables are modeled by an environment Types are a play between flexibility and safety

End of Lecture 6