CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire Lecture 2 Phases of a Compiler Lexical Analysis.

CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire Lecture 2 Phases of a Compiler Lexical Analysis

Announcements discussion session Monday –TA presented JLex: A Lexical Analyzer Generator for Java Read JLex user manual (it is available at the JLex website, the link is at the class webpage) Read the following chapters from the textbook –Chapter 1: Introduction –Chapter 2: A translator from infix expressions to postfix expressions –Chapter 3: Lexical analysis Homework 1 is due next Tuesday –Drop it in the homework box before the lecture or give it to me at the beginning of the lecture

Must recognize legal (and illegal) programs Must generate correct code Must manage storage of all variables (and code) Must agree with OS and linker on format for object code High-level View of a Compiler Source code Machine code Compiler Errors

A Higher Level View: How Does the Compiler Fit In? source progra m absolute machine code Compiler Preprocessor AssemblerLoader/Linker skeletal source program target assembly program relocatable machine code library routines, relocatable object files generates machine code from the assembly code collects the source program that is divided into seperate files macro expansion links the library routines and other object modules generates absolute addresses

Traditional Two-pass Compiler Use an intermediate representation ( IR ) Front end maps legal source code into IR Back end maps IR into target machine code Admits multiple front ends and multiple passes –Typically, front end is O(n) or O(n log n), while back end is NPC Different phases of compiler also interact through the symbol table Sour ce code Front End Errors Machine code Back End IR Symbol Table

Responsibilities Recognize legal programs Report errors for the illegal programs in a useful way Produce IR and construct the symbol table Much of front end construction can be automated The Front End Source code Scanner IR Parser tokens IRType Checker Errors

The Front End Scanner Maps character stream into words—the basic unit of syntax Produces tokens and stores lexemes when it is necessary –x = x + y ; becomes EQ PLUS SEMICOLON –Typical tokens include number, identifier, +, -, while, if Scanner eliminates white space and comments Source code Scanner IR Parser tokens IRType Checker Errors

The Front End Parser Uses scanner as a subroutine Recognizes context-free syntax and reports errors Guides context-sensitive analysis (type checking) Builds IR for source program Scanning and parsing can be grouped into one pass Source code Scanner IR Parser IRType Checker Errors token get next token

The Front End Context Sensitive Analysis Check if all the variables are declared before they are used Type checking –Check type errors such as adding a procedure and an array Add the necessary type conversions –int-to-float, float-to-double, etc. Source code Scanner IR Parser tokens IRType Checker Errors

The Back End Responsibilities Translate IR into target machine code Choose instructions to implement each IR operation Decide which values to keep in registers Schedule the instructions for instruction pipeline Automation has been much less successful in the back end Errors IR Instruction Scheduling Instruction Selection Machine code Register Allocation IR

The Back End Instruction Selection Produce fast, compact code Take advantage of target language features such as addressing modes Usually viewed as a pattern matching problem –ad hoc methods, pattern matching, dynamic programming This was the problem of the future in late 70’s when instruction sets were complex –RISC architectures simplified this problem Errors IR Instruction Scheduling Instruction Selection Machine code Register Allocation IR

The Back End Instruction Scheduling Avoid hardware stalls (keep pipeline moving) Use all functional units productively Optimal scheduling is NP-Complete Errors IR Instruction Scheduling Instruction Selection Machine code Register Allocation IR

The Back End Register allocation Have each value in a register when it is used Manage a limited set of registers Can change instruction choices and insert LOAD s and STORE s Optimal allocation is NP-Complete Compilers approximate solutions to NP-Complete problems Errors IR Instruction Scheduling Instruction Selection Machine code Register Allocation IR

Traditional Three-pass Compiler Code Optimization Analyzes IR and transforms IR Primary goal is to reduce running time of the compiled code –May also improve space, power consumption (mobile computing) Must preserve “meaning” of the code Errors Source Code Middle End Front End Machine code Back End IR

The Optimizer (or Middle End) Typical Transformations Discover and propagate constant values Move a computation to a less frequently executed place Discover a redundant computation and remove it Remove unreachable code Errors Opt1Opt1 Opt3Opt3 Opt2Opt2 OptnOptn... IR Modern optimizers are structured as a series of passes

First Phase: Lexical Analysis (Scanning) Scanner Maps stream of characters into words –Basic unit of syntax Characters that form a word are its lexeme Its syntactic category is called its token Scanner discards white space and comments Source code Scanner IR Parser Errors token get next token

Why Lexical Analysis? By separating context free syntax from lexical analysis –We can develop efficient scanners –We can automate efficient scanner construction –We can write simple specifications for tokens Scanner Generator specifications (regular expressions) source code tokens tables or code

What are Tokens? Token: Basic unit of syntax –Keywords if, while,... –Operators +, *, <=, ||,... –Identifiers (names of variables, arrays, procedures, classes) i, i1, j1, count, sum,... –Numbers 12, 3.14, 7.2E-2,...

What are Tokens? Tokens are terminal symbols for the parser –Tokens are treated as undivisible units in the grammar defining the source language 1. S  expr 2. expr  expr op term 3. | term 4. term  number 5. | id 6. op  + 7. | - number, id, +, - are tokens passed from scanner to parser. They form the terminal symbols of this simple grammar.

Lexical Concepts Token: Basic unit of syntax, syntactic output of the scanner Pattern: The rule that describes the set of strings that correspond to a token, specification of the token Lexeme: A sequence of input characters which match to a pattern and generate the token WHILEwhilewhile IFifif IDi1, length,letter followed by count, sqrtletters and digits Token LexemePattern

Tokens can have Attributes A problem If we send this output to the parser is it enough? Where are the variable names, procedure, names, etc.? All identifiers look the same. Tokens can have attributes that they can pass to the parser (using the symbol table) if (i == j) z = 0; else z = 1; becomes IF, LPAREN,ID,EQEQ,ID,RPAREN, ID,EQ,NUM,SEMICOLON,ELSE, ID,EQ,NUM,SEMICOLON IF, LPAREN,,EQEQ,,RPAREN,,EQ,,SEMICOLON,ELSE,,EQ,,SEMICOLON

How do we specify lexical patterns? Some patterns are easy Keywords and operators –Specified as literal patterns: if, then, else, while, =, +, …

Some patterns are more complex Identifiers –letter followed by letters and digits Numbers –Integer: 0 or a digit between 1 and 9 followed by digits between 0 and 9 –Decimal: An optional sign which can be “+” or “-” followed by digit “0” or a nonzero digit followed by an arbitrary number of digits followed by a decimal point followed by an arbitrary number of digits GOAL: We want to have concise descriptions of patterns, and we want to automatically construct the scanner from these descriptions Specifying Lexical Patterns

Specifying Lexical Patterns: Regular Expressions Regular expressions ( RE s) describe regular languages Regular Expression (over alphabet  )  (empty string) is a RE denoting the set {  } If a is in , then a is a RE denoting {a} If x and y are RE s denoting languages L(x) and L(y) then –x is an RE denoting L(x) –x | y is an RE denoting L(x)  L(y) –xy is an RE denoting L(x)L(y) –x * is an RE denoting L(x)* Precedence is closure, then concatenation, then alternation All left- associative x | y * z is equivalent to x | ((y * ) z)

Operations on Languages OperationDefinition Union of LandM WrittenL  M L  M= {s | s  L or s  M } Concatenation of LandM WrittenLM LM ={st |s  L andt  M } Kleene closureofL WrittenL * L * =  0  i  L i L+L+ =  1  i  L i Exponentiation of L Written L i L i = {  } if i = 0 L i-1 L if i > 0 Positive closure of L Written L +

Examples of Regular Expressions All strings of 1s and 0s ( 0 | 1 ) * All strings of 1s and 0s beginning with a 1 1 ( 0 | 1 ) * All strings of 0s and 1s containing at lest two consecutive 1s ( 0 | 1 ) * 1 1( 0 | 1 ) * All strings of alternating 0s and 1s (  | 1 ) ( 0 1 ) * (  | 0 )

Extensions to Regular Expressions (a la JLex) x+= x x*denotes L(x) + x? = x |  denotes L(x)  {  } [abc] = a | b | c matches one character in the square bracket a-z = a | b | c |... | z range [0-9a-z] = 0 | 1 | 2 |... | 9 | a | b | c |... | z [^abc]^ means negation matches any character except a, b or c. (dot) matches any character except the newline. = [^\n]\n means newline, dot is equivalent to [^\n] “[“matches left square bracket, metacharacters in double quotes become plain characters \[matches left square bracket, metacharacter after backslash becomes plain character

Regular Definitions We can define macros using regular expressions and use them in other regular expressions Letter  (a|b|c| … |z|A|B|C| … |Z) Digit  (0|1|2| … |9) Identifier  Letter ( Letter | Digit )* Important: We should be able to order these definitions so that every definition uses only the definitions defined before it (i.e., no recursion) Regular definitions can be converted to basic regular expressions with macro expansion In JLex enclose definitions using curly braces Identifier  {Letter} ( {Letter} | {Digit} )*

Examples of Regular Expressions Digit  (0|1|2| … |9) Integer  (+|-)? (0| (1|2|3| … |9)(Digit *)) Decimal  Integer “.” Digit * Real  ( Integer | Decimal ) E (+|-)?Digit * Complex  “(“ Real, Real “)” Numbers can get even more complicated.

From Regular Expressions to Scanners Regular expressions are useful for specifying patterns that correspond to tokens However, we also want to construct programs which recognize these patterns How do we do it? –Use finite automata!

Consider the problem of recognizing register names in an assembler Register  R (0|1|2| … |9) (0|1|2| … |9) * Allows registers of arbitrary number Requires at least one digit RE corresponds to a recognizer (or DFA) Example S0S0 S2S2 S1S1 R (0|1|2| … |9) accepting state (0|1|2| …|9) Recognizer for Register initial state SeSe R R (R|0|1|2| …|9) error state (0|1|2| …|9)

Deterministic Finite Automata (DFA) A set of states S –S = { s 0, s 1, s 2, s e } A set of input symbols (an alphabet)  –  = { R, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } A transition function  : S    S –Maps (state, symbol) pairs to states –  = { ( s 0, R)  s 1, ( s 0, 0-9)  s e, ( s 1, 0-9 )  s 2, ( s 1, R )  s e, ( s 2, 0-9 )  s 2, ( s 2, R )  s e, ( s e, R | 0-9 )  s e } A start state –s 0 A set of final (or accepting) states –Final = { s 2 } A DFA accepts a word x iff there exists a path in the transition graph from start state to a final state such that the edge labels along the path spell out x

DFA simulation Start in state s 0 and follow transitions on each input character DFA accepts a word x iff x leaves it in a final state (s 2 ) So, “R17” takes it through s 0, s 1, s 2 and accepts “R” takes it through s 0, s 1 and fails “A” takes it straight to s e “R17R” takes it through s 0, s 1, s 2, s e and rejects Example S0S0 S2S2 S1S1 R (0|1|2| …|9) accepting state (0|1|2| …|9) Recognizer for Register initial state

Simulating a DFA state = s 0 ; char = get_next_char(); while (char != EOF) { state =  (state,char); char =get_next_char(); } if (state  Final) report acceptance; else report failure;  R 0,1,2,3, 4,5,6, 7,8,9 other S 0 S 1 S e S e S 1 S e S 2 S e S 2 S e S 2 S e S e S e S e S e The recognizer translates directly into code To change DFA s, just change the arrays Takes O(|x|) time for input string x Final = { s 2 } We can also store the final states in an array We can store the transition table in a two-dimensional array:

Recognizing Longest Accepted Prefix accepted = false; current_string =  ; // empty string state = s 0 ; // initial state if (state  Final) { accepted_string = current_string; accepted = true; } char =get_next_char(); while (char != EOF) { state =  (state,char); current_string = current_string + char; if (state  Final) { accepted_string = current_string; accepted = true; } char =get_next_char(); } if accepted return accepted_string; else report error;  R 0,1,2,3, 4,5,6, 7,8,9 other S 0 S 1 S e S e S 1 S e S 2 S e S 2 S e S 2 S e S e S e S e S e Given an input string, this simulation algorithm returns the longest accepted prefix Given the input “R17R”, this simulation algorithm returns “R17” Final = { s 2 }

CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire Lecture 2 Phases of a Compiler Lexical Analysis.

Similar presentations

Presentation on theme: "CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire Lecture 2 Phases of a Compiler Lexical Analysis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire Lecture 2 Phases of a Compiler Lexical Analysis.

Similar presentations

Presentation on theme: "CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire Lecture 2 Phases of a Compiler Lexical Analysis."— Presentation transcript:

Similar presentations

About project

Feedback