COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.

Slides:



Advertisements
Similar presentations
4b Lexical analysis Finite Automata
Advertisements

Compiler Baojian Hua Lexical Analysis (II) Compiler Baojian Hua
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
ML-YACC David Walker COS 320. Outline Last Week –Introduction to Lexing, CFGs, and Parsing Today: –More parsing: automatic parser generation via ML-Yacc.
Lexical Analysis Mooly Sagiv html:// Textbook:Modern Compiler Implementation in C Chapter 2.
Lexical Analysis Compiler Baojian Hua
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Lexical Analysis Recognize tokens and ignore white spaces, comments
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis (I) Compiler Baojian Hua
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
CPS 506 Comparative Programming Languages Syntax Specification.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Compiler Construction Sohail Aslam Lecture 9. 2 DFA Minimization  The generated DFA may have a large number of states.  Hopcroft’s algorithm: minimizes.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
C Chuen-Liang Chen, NTUCS&IE / 35 SCANNING Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei,
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
Exercise Solution for Exercise (a) {1,2} {3,4} a b {6} a {5,6,1} {6,2} {4} {3} {5,6} { } b a b a a b b a a b a,b b b a.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Prof. Necula CS 164 Lecture 31 Lexical Analysis Lecture 3-4.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Deterministic Finite Automata Nondeterministic Finite Automata.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Department of Software & Media Technology
CS 3304 Comparative Languages
Lecture 2 Lexical Analysis
Chapter 3 Lexical Analysis.
Lexical Analysis CSE 340 – Principles of Programming Languages
Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Finite-State Machines (FSMs)
RegExps & DFAs CS 536.
Finite-State Machines (FSMs)
Review: Compiler Phases:
Lexical Analysis Lecture 3-4 Prof. Necula CS 164 Lecture 3.
CS 3304 Comparative Languages
4b Lexical analysis Finite Automata
CS 3304 Comparative Languages
Compiler Construction
4b Lexical analysis Finite Automata
Compiler Structures 2. Lexical Analysis Objectives
Presentation transcript:

COS 320 Compilers David Walker

Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel

The Front End Lexical Analysis: Create sequence of tokens from characters Syntax Analysis: Create abstract syntax tree from sequence of tokens Type Checking: Check program for well- formedness constraints LexerParser stream of characters stream of tokens abstract syntax Type Checker

Lexical Analysis Lexical Analysis: Breaks stream of ASCII characters (source) into tokens Token: An atomic unit of program syntax –i.e., a word as opposed to a sentence Tokens and their types: Type: ID REAL SEMI LPAREN NUM IF Characters Recognized: foo, x, listcount 10.45, 3.14, -2.1 ; ( 50, 100 if Token: ID(foo), ID(x),... REAL(10.45), REAL(3.14),... SEMI LPAREN NUM(50), NUM(100) IF

Lexical Analysis Example x = ( y ) ;

Lexical Analysis Example x = ( y ) ; ID(x) Lexical Analysis

Lexical Analysis Example x = ( y ) ; ID(x) ASSIGN Lexical Analysis

Lexical Analysis Example x = ( y ) ; ID(x) ASSIGN LPAREN ID(y) PLUS REAL(4.0) RPAREN SEMI Lexical Analysis

Lexer Implementation Implementation Options: 1.Write a Lexer from scratch –Boring, error-prone and too much work

Lexer Implementation Implementation Options: 1.Write a Lexer from scratch –Boring, error-prone and too much work 2.Use a Lexer Generator –Quick and easy. Good for lazy compiler writers. Lexer Specification

Lexer Implementation Implementation Options: 1.Write a Lexer from scratch –Boring, error-prone and too much work 2.Use a Lexer Generator –Quick and easy. Good for lazy compiler writers. Lexer Specification lexer generator Lexer

Lexer Implementation Implementation Options: 1.Write a Lexer from scratch –Boring, error-prone and too much work 2.Use a Lexer Generator –Quick and easy. Good for lazy compiler writers. Lexer Specification lexer generator Lexer stream of characters stream of tokens

How do we specify the lexer? –Develop another language –We’ll use a language involving regular expressions to specify tokens What is a lexer generator? –Another compiler....

Some Definitions We will want to define the language of legal tokens our lexer can recognize –Alphabet – a collection of symbols (ASCII is an alphabet) –String – a finite sequence of symbols taken from our alphabet –Language of legal tokens – a set of strings Language of ML keywords – set of all strings which are ML keywords (FINITE) Language of ML tokens – set of all strings which map to ML tokens (INFINITE) A language can also be a more general set of strings: –eg: ML Language – set of all strings representing correct ML programs (INFINITE).

Regular Expressions: Construction Base Cases: –For each symbol a in alphabet, a is a RE denoting the set {a} –Epsilon (e) denotes { } Inductive Cases (M and N are REs) –Alternation (M | N) denotes strings in M or N (a | b) == {a, b} –Concatenation (M N) denotes strings in M concatenated with strings in N (a | b) (a | c) == { aa, ac, ba, bc } –Kleene closure (M*) denotes strings formed by any number of repetitions of strings in M (a | b )* == {e, a, b, aa, ab, ba, bb,...}

Regular Expressions Integers begin with an optional minus sign, continue with a sequence of digits Regular Expression: (- | e) (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)*

Regular Expressions Integers begin with an optional minus sign, continue with a sequence of digits Regular Expression: (- | e) (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)* So writing (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9) and even worse (a | b | c |...) gets tedious...

Regular Expressions (REs) common abbreviations: –[a-c] == (a | b | c) –. == any character except \n –\n == new line character –a+ == one or more –a? == zero or one all abbreviations can be defined in terms of the “standard” REs

Ambiguous Token Rule Sets A single RE is a completely unambiguous specification of a token. –call the association of an RE with a token a “rule” To lex an entire programming language, we need many rules –but ambiguities arise: multiple REs or sequences of REs match the same string hence many token sequences possible

Ambiguous Token Rule Sets Example: –Identifier tokens: [a-z] [a-z0-9]* –Sample keyword tokens: if, then,... How do we tokenize: –foobar==> ID(foobar) or ID(foo) ID(bar) –if==> ID(if) or IF

Ambiguous Token Rule Sets We resolve ambiguities using two conventions: –Longest match: The regular expression that matches the longest string takes precedence. –Rule Priority: The regular expressions identifying tokens are written down in sequence. If two regular expressions match the same (longest) string, the first regular expression in the sequence takes precedence.

Ambiguous Token Rule Sets Example: –Identifier tokens: [a-z] [a-z0-9]* –Sample keyword tokens: if, then,... How do we tokenize: –foobar==> ID(foobar) or ID(foo) ID(bar) use longest match to disambiguate –if==> ID(if) or IF keyword rules have higher priority than identifier rule

Lexer Implementation Implementation Options: 1.Write Lexer from scratch –Boring and error-prone 2.Use Lexical Analyzer Generator –Quick and easy ml-lex is a lexical analyzer generator for ML. lex and flex are lexical analyzer generators for C.

ML-Lex Specification Lexical specification consists of 3 parts: User Declarations (plain ML types, values, functions) % ML-LEX Definitions (RE abbreviations, special stuff) % Rules (association of REs with tokens) (each token will be represented in plain ML)

User Declarations User Declarations: –User can define various values that are available to the action fragments. –Two values must be defined in this section: type lexresult –type of the value returned by each rule action. fun eof () –called by lexer when end of input stream is reached.

ML-LEX Definitions ML-LEX Definitions: –User can define regular expression abbreviations: –Define multiple lexers to work together. Each is given a unique name. DIGITS = [0-9] +; LETTER = [a-zA-Z]; %s LEX1 LEX2 LEX3;

Rules Rules: A rule consists of a pattern and an action: –Pattern in a regular expression. –Action is a fragment of ordinary ML code. –Longest match & rule priority used for disambiguation Rules may be prefixed with the list of lexers that are allowed to use this rule. regular_expression => (action.code) ;

Rules Rule actions can use any value defined in the User Declarations section, including –type lexresult type of value returned by each rule action –val eof : unit -> lexresult called by lexer when end of input stream reached special variables: –yytext: input substring matched by regular expression –yypos: file position of the beginning of matched string –continue (): doesn’t return token; recursively calls lexer

A Simple Lexer datatype token = Num of int | Id of string | IF | THEN | ELSE | EOF type lexresult = token (* mandatory *) fun eof () = EOF (* mandatory *) fun itos s = case Int.fromString s of SOME x => x | NONE => raise fail % NUM = [1-9][0-9]* ID = [a-zA-Z] ([a-zA-Z] | NUM)* % if => (IF); then => (THEN); else => (ELSE); {NUM}=> (Num (itos yytext)); {ID}=> (Id yytext);

Using Multiple Lexers Rules prefixed with a lexer name are matched only when that lexer is executing Initial lexer is called INITIAL Enter new lexer using: –YYBEGIN LEXERNAME; Aside: Sometimes useful to process characters, but not return any token from the lexer. Use: –continue ();

Using Multiple Lexers type lexresult = unit (* mandatory *) fun eof () = () (* mandatory *) % %s COMMENT % if => (); [a-z]+ => (); “(*” => (YYBEGIN COMMENT; continue ()); “*)” => (YYBEGIN INITIAL; continue ()); “\n” |. => (continue ());

A (Marginally) More Exciting Lexer type lexresult = string (* mandatory *) fun eof () = (print “End of file\n”; “EOF”) (* mandatory *) % %s COMMENT INT = [1-9] [0-9]*; % if => (“IF”); then=> (“THEN”); {INT} => ( “INT(“ ^ yytext ^ “)” ); “(*” => (YYBEGIN COMMENT; continue ()); “*)” => (YYBEGIN INITIAL; continue ()); “\n” |. => (continue ());

Implementing ML-Lex By compiling, of course: –convert REs into non-deterministic finite automata –convert non-deterministic finite automata into deterministic finite automata –convert deterministic finite automata into a blazingly fast table-driven algorithm you did mostly everything but possibly the last step in your favorite algorithms class –need to deal with disambiguation & rule priority –need to deal with multiple lexers

Refreshing your memory: RE ==> NDFA ==> DFA Lex rules: if => (Tok.IF) [a-z][a-z0-9]* => (Tok.Id;)

Refreshing your memory: RE ==> NDFA ==> DFA Lex rules: if => (Tok.IF) [a-z][a-z0-9]* => (Tok.Id;) NDFA: a-z f i a-z0-9 3 Tok.IF Tok.Id

Refreshing your memory: RE ==> NDFA ==> DFA Lex rules: if => (Tok.IF) [a-z][a-z0-9]* => (Tok.Id;) NDFA: DFA: a-z f i a-z0-9 3 Tok.IF Tok.Id 1 4 2,4 a-hj-z f i a-z0-9 3,4 Tok.IF Tok.Id (could be Tok.Id; decision made by rule priority) a-eg-z0-9 a-z0-9

Table-driven algorithm NDFA: 1 4 2,4 a-hj-z f i a-z0-9 3,4 Tok.IF Tok.Id a-eg-z0-9 a-z0-9

Table-driven algorithm NDFA (states conveniently renamed): S1 S4 S2 a-hj-z f i a-z0-9 S3 Tok.IF Tok.Id a-eg-z0-9 a-z0-9

Table-driven algorithm DFA: Transition Table: S4 S2S4 S1 S2 S3 S4 a b... i S1 S4 S2 a-hj-z f i a-z0-9 S3 Tok.IF Tok.Id a-eg-z0-9 a-z0-9

Table-driven algorithm DFA: Transition Table: S4 S2S4 S1 S2 S3 S4 a b... i S1 S4 S2 a-hj-z f i a-z0-9 S3 Tok.IF Tok.Id a-eg-z0-9 a-z0-9 -Tok.IdTok.IFTok.Id S1 S2 S3 S4 Final State Table:

Table-driven algorithm DFA: Transition Table: S4 S2S4 S1 S2 S3 S4 a b... i S1 S4 S2 a-hj-z f i a-z0-9 S3 Tok.IF Tok.Id a-eg-z0-9 a-z0-9 -Tok.IdTok.IFTok.Id S1 S2 S3 S4 Final State Table: Algorithm: Start in start state Transition from one state to next using transition table Every time you reach a potential final state, remember it + position in stream When no more transitions apply, revert to last final state seen + position Execute associated rule code

Dealing with Multiple Lexers Lex rules: if => (Tok.IF); [a-z][a-z0-9]* => (Tok.Id); “(*” => (YYBEGIN COMMENT; continue ()); “*)” => (YYBEGIN INITIAL; continue ());. => (continue ());

Dealing with Multiple Lexers Lex rules: if => (Tok.IF); [a-z][a-z0-9]* => (Tok.Id); “(*” => (YYBEGIN COMMENT; continue ()); “*)” => (YYBEGIN INITIAL; continue ());. => (continue ()); (* COMMENTINITIAL *) [a-z][a-z0-9].

Summary A Lexer: –input: stream of characters –output: stream of tokens Writing lexers by hand is boring, so we use a lexer generator: ml-lex –lexer generators work by converting REs through automata theory to efficient table-driven algorithms. Moral: don’t underestimate your theory classes! –great application of cool theory developed in the 70s. –we’ll see more cool apps as the course progresses