1 Languages and Compilers (SProg og Oversættere) Lexical analysis.

Slides:



Advertisements
Similar presentations
1 2.Lexical Analysis 2.1Tasks of a Scanner 2.2Regular Grammars and Finite Automata 2.3Scanner Implementation.
Advertisements

4b Lexical analysis Finite Automata
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering CSCE 531 Compiler Construction Ch.4: Lexical Analysis Spring 2008 Marco Valtorta.
1 Foundations of Software Design Lecture 24: Compilers, Lexers, and Parsers; Intro to Graphs Marti Hearst Fall 2002.
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
College of Computer Science & Technology Compiler Construction Principles & Implementation Techniques -1- Compiler Construction Principles & Implementation.
Compiler Construction
Syntax Analysis (Chapter 4) 1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture.
Automata and Regular Expression Discrete Mathematics and Its Applications Baojian Hua
Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 3 Lexical and Syntactic Analysis Syntactic.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
Scanner Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Syntax Analysis (Chapter 4) 1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture.
1 Languages and Compilers (SProg og Oversættere) Parsing.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Syntax Analysis (Chapter 4) 1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
1 Languages and Compilers (SProg og Oversættere) Lecture 4 Bent Thomsen Department of Computer Science Aalborg University With acknowledgement to Norm.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
1 Languages and Compilers (SProg og Oversættere) Bent Thomsen Department of Computer Science Aalborg University With acknowledgement to Norm Hutchinson.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Lexer input and Output i d 3 = 0 LF w i d 3 = 0 LF w id3 = 0 while ( id3 < 10 ) id3 = 0 while ( id3 < 10 ) lexer Stream of Char-s ( lazy List[Char] ) class.
CPS 506 Comparative Programming Languages Syntax Specification.
1 Languages and Compilers (SProg og Oversættere) Bent Thomsen Department of Computer Science Aalborg University With acknowledgement to Norm Hutchinson.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Introduction 1 Kavita Patel. Outlines 2  1.1 What Do Compilers Do?  1.2 The Structure of a Compiler  1.3 Compilation Process  1.4 Phases.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
1 Languages and Compilers (SProg og Oversættere) Bent Thomsen Department of Computer Science Aalborg University With acknowledgement to Norm Hutchinson.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
1 Compiler Construction Vana Doufexi office CS dept.
Deterministic Finite Automata Nondeterministic Finite Automata.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Department of Software & Media Technology
Lecture 2 Lexical Analysis
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Languages and Compilers (SProg og Oversættere)
CSc 453 Lexical Analysis (Scanning)
Finite-State Machines (FSMs)
Lexical analysis Finite Automata
Finite-State Machines (FSMs)
Recognizer for a Language
Department of Software & Media Technology
CS 3304 Comparative Languages
Lecture 4: Lexical Analysis & Chomsky Hierarchy
4b Lexical analysis Finite Automata
CS 3304 Comparative Languages
Compiler Construction
4b Lexical analysis Finite Automata
Course Overview PART I: overview material PART II: inside a compiler
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

1 Languages and Compilers (SProg og Oversættere) Lexical analysis

2 –Describe the role of the lexical analysis phase –Describe regular expressions and finite automata

3 Syntax Analysis: Scanner Scanner Source Program Abstract Syntax Tree Error Reports Parser Stream of “Tokens” Stream of Characters Error Reports Dataflow chart

4 1) Scan: Divide Input into Tokens An example mini Triangle source program: let var y: Integer in !new year y := y+1 let var ident. y scanner colon : ident. Integer in ident. y becomes :=... ident. y op. + intlit 1 eot Tokens are “words” in the input, for example keywords, operators, identifiers, literals, etc.

5 Steps for Developing a Scanner 1) Express the “lexical” grammar in EBNF (do necessary transformations) 2) Implement Scanner based on this grammar 3) Refine scanner to keep track of spelling and kind of currently scanned token. Thus developing a scanner is a mechanical task.

6 Developing a Scanner Express the “lexical” grammar in EBNF Token ::= Identifier | Integer-Literal | Operator | ; | : | := | ~ | ( | ) | eot Identifier ::= Letter (Letter | Digit)* Integer-Literal ::= Digit Digit* Operator ::= + | - | * | / | | = Separator ::= Comment | space | eol Comment ::= ! Graphic* eol Token ::= Identifier | Integer-Literal | Operator | ; | : | := | ~ | ( | ) | eot Identifier ::= Letter (Letter | Digit)* Integer-Literal ::= Digit Digit* Operator ::= + | - | * | / | | = Separator ::= Comment | space | eol Comment ::= ! Graphic* eol Now perform substitution and left factorization... Token ::= Letter (Letter | Digit)* | Digit Digit* | + | - | * | / | | = | ; | : (=|  ) | ~ | ( | ) | eot Separator ::= ! Graphic* eol | space | eol Token ::= Letter (Letter | Digit)* | Digit Digit* | + | - | * | / | | = | ; | : (=|  ) | ~ | ( | ) | eot Separator ::= ! Graphic* eol | space | eol

7 Developing a Scanner private byte scanToken() { switch (currentChar) { case ‘a’: case ‘b’:... case ‘z’: case ‘A’: case ‘B’:... case ‘Z’: scan Letter (Letter | Digit)* return Token.IDENTIFIER; case ‘0’:... case ‘9’: scan Digit Digit* return Token.INTLITERAL ; case ‘+’: case ‘-’:... : case ‘=’: takeIt(); return Token.OPERATOR;...etc... } private byte scanToken() { switch (currentChar) { case ‘a’: case ‘b’:... case ‘z’: case ‘A’: case ‘B’:... case ‘Z’: scan Letter (Letter | Digit)* return Token.IDENTIFIER; case ‘0’:... case ‘9’: scan Digit Digit* return Token.INTLITERAL ; case ‘+’: case ‘-’:... : case ‘=’: takeIt(); return Token.OPERATOR;...etc... } Token ::= Letter (Letter | Digit)* | Digit Digit* | + | - | * | / | | = | ; | : (=|  ) | ~ | ( | ) | eot Token ::= Letter (Letter | Digit)* | Digit Digit* | + | - | * | / | | = | ; | : (=|  ) | ~ | ( | ) | eot

8 Developing a Scanner... return... case ‘a’: case ‘b’:... case ‘z’: case ‘A’: case ‘B’:... case ‘Z’: scan Letter (Letter | Digit)* return Token.IDENTIFIER; case ‘0’:... case ‘9’:... return... case ‘a’: case ‘b’:... case ‘z’: case ‘A’: case ‘B’:... case ‘Z’: scan Letter (Letter | Digit)* return Token.IDENTIFIER; case ‘0’:... case ‘9’:... return... case ‘a’: case ‘b’:... case ‘z’: case ‘A’: case ‘B’:... case ‘Z’: scan Letter scan (Letter | Digit)* return Token.IDENTIFIER; case ‘0’:... case ‘9’:... return... case ‘a’: case ‘b’:... case ‘z’: case ‘A’: case ‘B’:... case ‘Z’: scan Letter scan (Letter | Digit)* return Token.IDENTIFIER; case ‘0’:... case ‘9’:... return... case ‘a’: case ‘b’:... case ‘z’: case ‘A’: case ‘B’:... case ‘Z’: acceptIt(); scan (Letter | Digit)* return Token.IDENTIFIER; case ‘0’:... case ‘9’:... return... case ‘a’: case ‘b’:... case ‘z’: case ‘A’: case ‘B’:... case ‘Z’: acceptIt(); scan (Letter | Digit)* return Token.IDENTIFIER; case ‘0’:... case ‘9’:... return... case ‘a’: case ‘b’:... case ‘z’: case ‘A’: case ‘B’:... case ‘Z’: acceptIt(); while (isLetter(currentChar) || isDigit(currentChar) ) scan (Letter | Digit) return Token.IDENTIFIER; case ‘0’:... case ‘9’:... return... case ‘a’: case ‘b’:... case ‘z’: case ‘A’: case ‘B’:... case ‘Z’: acceptIt(); while (isLetter(currentChar) || isDigit(currentChar) ) scan (Letter | Digit) return Token.IDENTIFIER; case ‘0’:... case ‘9’:... return... case ‘a’: case ‘b’:... case ‘z’: case ‘A’: case ‘B’:... case ‘Z’: acceptIt(); while (isLetter(currentChar) || isDigit(currentChar) ) acceptIt(); return Token.IDENTIFIER; case ‘0’:... case ‘9’:... return... case ‘a’: case ‘b’:... case ‘z’: case ‘A’: case ‘B’:... case ‘Z’: acceptIt(); while (isLetter(currentChar) || isDigit(currentChar) ) acceptIt(); return Token.IDENTIFIER; case ‘0’:... case ‘9’:... Let’s look at the identifier case in more detail Thus developing a scanner is a mechanical task.

9 Generating Scanners Generation of scanners is based on –Regular Expressions: to describe the tokens to be recognized –Finite State Machines: an execution model to which RE’s are “compiled” Recap: Regular Expressions  The empty string tGenerates only the string t X YGenerates any string xy such that x is generated by x and y is generated by Y X | YGenerates any string which generated either by X or by Y X*The concatenation of zero or more strings generated by X (X)For grouping

10 Generating Scanners Regular Expressions can be recognized by a finite state machine. (often used synonyms: finite automaton (acronym FA)) Definition: A finite state machine is an N-tuple (States, ,start, ,End) States  A finite set of “states”  An “alphabet”: a finite set of symbols from which the strings we want to recognize are formed (for example: the ASCII char set) startA “start state” Start  States  Transition relation  States x States x . These are “arrows” between states labeled by a letter from the alphabet. EndA set of final states. End  States

11 Deterministic, and non-deterministic DFA A FA is called deterministic (acronym: DFA) if for every state and every possible input symbol, there is only one possible transition to chose from. Otherwise it is called non-deterministic (NDFA). M M r s Q: Is this FSM deterministic or non-deterministic:

12 Deterministic, and non-deterministic FA Theorem: every NDFA can be converted into an equivalent DFA. M M r s DFA ? M r s

13 FA and the implementation of Scanners Token definitions Regular expressions Scanner DFA Java or C or... Scanner Generator What a typical scanner generator does: A possible algorithm: - Convert RE into NDFA-  - Convert NDFA-  into NDFA - Convert NDFA into DFA - generate Java/C/... code note: In practice this exact algorithm is not used. For reasons of performance, sophisticated optimizations are used. direct conversion from RE to DFA minimizing the DFA

14 Converting a RE into an NDFA-  RE:  FA: RE: t FA: RE: XY FA: t XY 

15 Converting a RE into an NDFA-  RE: X|Y FA: X Y RE: X* FA: X      

16 Implementing a DFA Definition: A finite state machine is an N-tuple (States, ,start, ,End) States  N different states => integers {0,..,N-1} => int data type  byte or char data type. startAn integer number  Transition relation  States x  x States. For a DFA this is a function States x  -> States Represented by a two dimensional array (one dimension for the current state, another for the current character. The contents of the array is the next state. EndA set of final states. Represented (for example) by an array of booleans (mark final state by true and other states by false)

17 We don’t do this by hand anymore! Writing scanners is a rather “robotic” activity which can be automated. JLex –input: a set of r.e.’s and action code –output a fast lexical analyzer (scanner) –based on a DFA Or the lexer is built into the parser generator as in JavaCC

18 JLex Lexical Analyzer Generator for Java Definition of tokens Regular Expressions JLex Java File: Scanner Class Recognizes Tokens