CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

4b Lexical analysis Finite Automata
Compiler Baojian Hua Lexical Analysis (II) Compiler Baojian Hua
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Lecture 02 – Lexical Analysis Eran Yahav 1. 2 You are here Executable code exe Source text txt Compiler Lexical Analysis Syntax Analysis Parsing Semantic.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
Prof. Hilfinger CS 164 Lecture 21 Lexical Analysis Lecture 2-4 Notes by G. Necula, with additions by P. Hilfinger.
Compiler Construction
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Lexical Analysis Recognize tokens and ignore white spaces, comments
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
CS 426 Compiler Construction
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
1 Semantic Analysis Aaron Bloomfield CS 415 Fall 2005.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis Mooly Sagiv Schrierber Wed 10:00-12:00 html:// Textbook:Modern.
1 Top Down Parsing. CS 412/413 Spring 2008Introduction to Compilers2 Outline Top-down parsing SLL(1) grammars Transforming a grammar into SLL(1) form.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
Lexical Analysis: Regular Expressions CS 671 January 22, 2008.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
What is a language? An alphabet is a well defined set of characters. The character ∑ is typically used to represent an alphabet. A string : a finite.
1 Languages and Compilers (SProg og Oversættere) Lexical analysis.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
CS375 Compilers Lexical Analysis 4 th February, 2010.
Lexical Analysis – Part I EECS 483 – Lecture 2 University of Michigan Monday, September 11, 2006.
CSc 453 Lexical Analysis (Scanning)
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 3: Introduction to Syntactic Analysis.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
Lexical Analysis.
1st Phase Lexical Analysis
Prof. Necula CS 164 Lecture 31 Lexical Analysis Lecture 3-4.
Lexical Analysis: Regular Expressions CS 471 September 3, 2007.
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
1 Compiler Construction Vana Doufexi office CS dept.
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Department of Software & Media Technology
CS314 – Section 5 Recitation 2
Lecture 2 Lexical Analysis
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Finite-State Machines (FSMs)
CSc 453 Lexical Analysis (Scanning)
RegExps & DFAs CS 536.
Finite-State Machines (FSMs)
Two issues in lexical analysis
Review: Compiler Phases:
Lexical Analysis Lecture 3-4 Prof. Necula CS 164 Lecture 3.
Lecture 4: Lexical Analysis & Chomsky Hierarchy
4b Lexical analysis Finite Automata
Compiler Construction
4b Lexical analysis Finite Automata
Compiler Construction
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 2 Outline Administration Compilation in a nutshell (or two) What is lexical analysis? Specifying tokens: regular expressions Writing a lexer Writing a lexer generator –Converting regular expressions to Non-deterministic finite automata (NFAs) –NFA to DFA transformation

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 3 Administration Web page has moved:

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 4 Compilation in a Nutshell 1 Source code (character stream) Lexical analysis Parsing Token stream Abstract syntax tree (AST) Semantic Analysis if (b == 0) a = b; if(b)a=b;0== if == b0 = ab if == int b int 0 = int a lvalue int b boolean Decorated AST int ; ;

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 5 Compilation in a Nutshell 2 Intermediate Code Generation Optimization Code generation if == int b int 0 = int a lvalue int b boolean int ; CJUMP == MEM fp8 + CONSTMOVE 0MEM fp4 8 NOP + + CJUMP == CONSTMOVE 0DXCX NOP CX CMP CX, 0 CMOVZ DX,CX

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 6 Lexical Analysis Source code (character stream) Lexical analysis Parsing Token stream Semantic Analysis if (b == 0) a = b; if(b)a=b;0==

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 7 What is Lexical Analysis? Converts character stream to token stream if (x1 * x2<1.0) { y = x1; } if ( x1*x2< 1. 0 ) { \n Keyword: if ( Id: x1 * Id: x2 <Num: 1.0){ Id: y

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 8 Tokens Identifiers: x y11 elsex _i00 Keywords: if else while break Integers: L Floating point: e5 Symbols: + * { } ++ = Strings: “x” “He said, \“Are you?\”” Comments: /** comment **/

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 9 Problems How to describe tokens 2.e0 20.e How to break text up into tokens if (x == 0) a = x<<1; iff (x == 0) a = x<1; How to tokenize efficiently –tokens may have similar prefixes –want to look at each character ~1 time

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 10 How to Describe Tokens Programming language tokens can be described as regular expressions A regular expression R describes some set of strings L(R) –L(abc) = { “abc” } e.g., tokens for floating point numbers, identifiers, etc.

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 11 Regular Expression Notation aordinary character stands for itself  the empty string R|Sany string from either L(R) or L(S) RSstring from L(R) followed by one from L(S) R*zero or more strings from L(R), concatenated  |R|RR|RRR|RRRR …

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 12 RE Shorthand R+one or more strings from L(R): R(R*) R?optional R: (R|  ) [a-z]one character from this range: (a|b|c|d|e|...) [^a-z] one character not from this range

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 13 Examples Regular ExpressionStrings in L(R) a “a” ab “ab” a | b “a” “b” ε“” (ab)*“” “ab” “abab” … (a | ε) b“ab” “b”

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 14 More Practical Examples Regular ExpressionStrings in L(R) digit = [0-9]“0” “1” “2” “3” … posint = digit+“8” “412” … int = -? posint“-42” “1024” … real = int (ε | (. posint))“-1.56” “12” “1.0” [a-zA-Z_][a-zA-Z0-9_]*C, Java identifiers

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 15 How to break up text elsex = 0; Most languages: longest matching token wins Exception: FORTRAN (totally whitespace- insensitive) Implication of longest matching token rule: –automatically convert RE’s into lexer elsex= elsex=

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 16 Scan text one character at a time Use look-ahead character to determine when current token ends while (identifierChar(next)) { token[len++] = next; next = input.read (); } Implementing a lexer else x

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 17 Problems What if it’s a keyword? –Can look up identifier in hash table containing all keywords Verbose, spaghetti code; not easily maintainable or extensible –suppose add new floating point representation (as in C) -- may need to rip up existing tokenizer code

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 18 Solution: Lexer Generator Input to lexer generator: –set of regular expressions –associated action for each RE (generates appropriate kind of token, other bookkeeping). Output: –program that reads an input stream and breaks it up into tokens according to the REs. (Or reports lexical error -- “Unexpected character” ) Examples: lex, flex, JLex

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 19 How does it work? Translate REs into one non-deterministic finite automaton (NFA) Convert NFA into an equivalent DFA Minimize DFA (to reduce # states) Emit code driven by DFA tables Advantage: DFAs efficient to implement –inner loop: look up next state using current state & look-ahead character

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 20 RE  NFA -?[0-9]+ (-|  ) [0-9][0-9]*  — 0,1,2..  NFA: multiple arcs may have same label,  transitions don’t eat input

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 21 NFA  minimized DFA — 0-9 Arcs may not conflict, no  transitions

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 22 DFA  table — other A B CERR BERR CERR CERR C A A B C

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 23 Lexer Generator Output Token nextToken() { state = START; // reset the DFA while (nextChar != EOF) { next = table [state][nextChar]; if (next == START) return new Token(state); state = next; nextChar = input.read( ); } +table +user actions in Token()

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 24 Converting an RE to an NFA a ε ε r r s s If r and s are regular expressions with the NFA’s a

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 25 Inductive Construction r s r s r | s r s ε εε ε r* r ε ε ε ε ε

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 26 Converting NFA to DFA NFA inefficient to implement directly, so convert to a DFA that recognizes the same strings Idea: –NFA can be in multiple states simultaneously –Each DFA state corresponds to a distinct set of NFA states –n-state NFA may be 2 n state DFA in theory, not in practice (construct states lazily & minimize)

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 27 NFA states to DFA states WX  Y 0,1,2.. Z  Possible sets of states: W&X, X, Y&Z DFA states: A B C — — 0-9 A B C

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 28 Handling multiple token REs whitespace identifier number operators keywords      NFADFA

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 29 Running DFA On character not in transition table –If in final state, return token & reset DFA –Otherwise report lexical error

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 30 Summary Lexical analyzer converts a text stream to tokens Tokens defined using regular expressions Regular expressions can be converted to a simple table-driven tokenizer by converting them to NFAs and then to DFAs. Have covered chapters 1-2 from Appel

CS 412/413 Introduction to Compilers and Translators -- Spring '99 Andrew Myers 31 Groups If you haven’t got a full group, hang around after lecture and talk to prospective group members If you find a group, fill in group handout and submit it Send mail to cs412 if you still cannot make a full group Submit questionnaire today in any case