Languages and Compilers (SProg og Oversættere)

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

4b Lexical analysis Finite Automata
Formal Language, chapter 4, slide 1Copyright © 2007 by Adam Webber Chapter Four: DFA Applications.
Compiler Baojian Hua Lexical Analysis (II) Compiler Baojian Hua
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering CSCE 531 Compiler Construction Ch.4: Lexical Analysis Spring 2008 Marco Valtorta.
Languages and Compilers (SProg og Oversættere) Lecture 4
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Scanning with Jflex.
1 Material taught in lecture Scanner specification language: regular expressions Scanner generation using automata theory + extra book-keeping.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CS 536 Spring Learning the Tools: JLex Lecture 6.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
1 Languages and Compilers (SProg og Oversættere) Lecture 4 Bent Thomsen Department of Computer Science Aalborg University With acknowledgement to Norm.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
D. M. Akbar Hussain: Department of Software & Media Technology 1 Compiler is tool: which translate notations from one system to another, usually from source.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
1 Languages and Compilers (SProg og Oversættere) Bent Thomsen Department of Computer Science Aalborg University With acknowledgement to Norm Hutchinson.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
1 Languages and Compilers (SProg og Oversættere) Lexical analysis.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1 Languages and Compilers (SProg og Oversættere) Bent Thomsen Department of Computer Science Aalborg University With acknowledgement to Norm Hutchinson.
Introduction to Lex Ying-Hung Jiang
CSc 453 Lexical Analysis (Scanning)
1 Languages and Compilers (SProg og Oversættere) Lecture 4 Bent Thomsen Department of Computer Science Aalborg University With acknowledgement to Norm.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
C Chuen-Liang Chen, NTUCS&IE / 35 SCANNING Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei,
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
1 Languages and Compilers (SProg og Oversættere) Bent Thomsen Department of Computer Science Aalborg University With acknowledgement to Norm Hutchinson.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Department of Software & Media Technology
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Lecture 2 Lexical Analysis
Chapter 3 Lexical Analysis.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
CSc 453 Lexical Analysis (Scanning)
Finite-State Machines (FSMs)
Lexical analysis Finite Automata
CSc 453 Lexical Analysis (Scanning)
RegExps & DFAs CS 536.
Finite-State Machines (FSMs)
Compiler Lecture 1 CS510.
JLex Lecture 4 Mon, Jan 26, 2004.
Recognizer for a Language
CPSC 388 – Compiler Design and Construction
Lecture 5: Lexical Analysis III: The final bits
CS 3304 Comparative Languages
4b Lexical analysis Finite Automata
CS 3304 Comparative Languages
4b Lexical analysis Finite Automata
Lexical Analysis - An Introduction
Lexical Analysis.
Compiler Construction
Lecture 5 Scanning.
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Languages and Compilers (SProg og Oversættere) Bent Thomsen Department of Computer Science Aalborg University With acknowledgement to Norm Hutchinson who’s slides this lecture is based on.

Generation of Scanners and Parsers Why do we talk about this stuff? From the material covered thus far it is obvious that writing parsers and scanners is a rather “robotic” activity. => Many code generation tools exist to automate the process. => Few people actually write parsers/scanners manually We already have a reasonable idea about generation of recursive descent parsers (LL(k)), and we have looked at the JavaCC LL(k) parser generator, but the most widely used parser generators generate bottom up parsers or LR. We will look at JavaCUP and SableCC. It is useful to know at least a little bit more about the underlying machinery and theory to be able to use such very practical tools effectively!

Generating Scanners Generation of scanners is based on Regular Expressions: to describe the tokens to be recognized Finite State Machines: an execution model to which RE’s are “compiled” Recap: Regular Expressions e The empty string t Generates only the string t X Y Generates any string xy such that x is generated by x and y is generated by Y X | Y Generates any string which generated either by X or by Y X* The concatenation of zero or more strings generated by X (X) For grouping

Generating Scanners Regular Expressions can be recognized by a finite state machine. (often used synonyms: finite automaton (acronym FA)) Definition: A finite state machine is an N-tuple (States,S,start,d ,End) States A finite set of “states” S An “alphabet”: a finite set of symbols from which the strings we want to recognize are formed (for example: the ASCII char set) start A “start state” Start  States d Transition relation d  States x States x S. These are “arrows” between states labeled by a letter from the alphabet. End A set of final states. End  States

Generating Scanners Finite state machine: the easiest way to describe a Finite State Machine is by means of a picture: Example: an FA that recognizes M r | M s = initial state r = final state M = non-final state M s

Deterministic, and non-deterministic DFA A FA is called deterministic (acronym: DFA) if for every state and every possible input symbol, there is only one possible transition to chose from. Otherwise it is called non-deterministic (NDFA). Q: Is this FSM deterministic or non-deterministic: r M M s

Deterministic, and non-deterministic FA Theorem: every NDFA can be converted into an equivalent DFA. M r s M r s DFA ?

Deterministic, and non-deterministic FA Theorem: every NDFA can be converted into an equivalent DFA. Algorithm: The basic idea: DFA is defined as a machine that does a “parallel simulation” of the NDFA. The states of the DFA are subsets of the states of the NDFA (i.e. every state of the DFA is a set of states of the NDFA) => This state can be interpreted as meaning “the simulated DFA is now in any of these states”

Deterministic, and non-deterministic FA Conversion algorithm example: r M 2 3 M 1 r {3,4} is a final state because 3 is a final state 4 r r,s r {3,4} {2,4} --r-->{3,4} because: 2 --r--> 3 4 --r--> 3 4 --r--> 4 M r s s {1} {4} {2,4} s

FA with e moves (N)DFA-e automata are like (N)DFA. In an (N)DFA-e we are allowed to have transitions which are “e-moves”. Example: M r (M r)* M r e Theorem: every (N)DFA-e can be converted into an equivalent NDFA (without e-moves). M r r M

FA with e moves Theorem: every (N)DFA-e can be converted into an equivalent NDFA (without e-moves). Algorithm: 1) converting states into final states: if a final state can be reached from a state S using an e-transition convert it into a final state. convert into a final state e Repeat this rule until no more states can be converted. For example: convert into a final state e e 2 1

FA with e moves Algorithm: 1) converting states into final states. 2) adding transitions (repeat until no more can be added) a) for every transition followed by e-transition t e add transition t b) for every transition preceded by e-transition e t add transition t 3) delete all e-transitions

FA and the implementation of Scanners Regular expressions, (N)DFA-e and NDFA and DFA’s are all equivalent formalism in terms of what languages can be defined with them. Regular expressions are a convenient notation for describing the “tokens” of programming languages. Regular expressions can be converted into FA’s (the algorithm for conversion into NDFA-e is straightforward) DFA’s can be easily implemented as computer programs. will explain this in subsequent slides

FA and the implementation of Scanners What a typical scanner generator does: Scanner Generator Scanner DFA Java or C or ... Token definitions Regular expressions note: In practice this exact algorithm is not used. For reasons of performance, sophisticated optimizations are used. direct conversion from RE to DFA minimizing the DFA A possible algorithm: - Convert RE into NDFA-e - Convert NDFA-e into NDFA - Convert NDFA into DFA - generate Java/C/... code

Converting a RE into an NDFA-e RE: e FA: RE: t FA: t RE: XY FA: e X Y

Converting a RE into an NDFA-e RE: X|Y FA: e X Y e RE: X* FA: X

Implementing a DFA Definition: A finite state machine is an N-tuple (States,S,start,d ,End) States N different states => integers {0,..,N-1} => int data type S byte or char data type. start An integer number d Transition relation d  States x S x States. For a DFA this is a function States x S -> States Represented by a two dimensional array (one dimension for the current state, another for the current character. The contents of the array is the next state. End A set of final states. Represented (for example) by an array of booleans (mark final state by true and other states by false)

Implementing a DFA public class Recognizer { static boolean[] finalState = final state table ; static int[][] delta = transition table ; private byte currentCharCode = get first char ; private int currentState = start state ; public boolean recognize() { while (currentCharCode is not end of file) && (currentState is not error state ) { currentState = delta[currentState][currentCharCode]; currentCharCode = get next char ; return finalState[currentState]; }

Implementing a Scanner as a DFA Slightly different from previously shown implementation (but similar in spirit): Not the goal to match entire input => when to stop matching? Match longest possible token before reaching error state. How to identify matched token class (not just true|false) Final state determines matched token class

Implementing a Scanner as a DFA public class Scanner { static int[] matchedToken = maps state to token class static int[][] delta = transition table ; private byte currentCharCode = get first char ; private int currentState = start state ; private int tokbegin = begining of current token ; private int tokend = end of current token private int tokenKind ; ...

Implementing a Scanner as a DFA public Token scan() { skip separator (implemented as DFA as well) tokbegin = current source position tokenKind = error code while (currentState is not error state ) { if (currentState is final state ) { tokend = current source location ; tokenKind = matchedToken[currentState]; currentState = delta[currentState][currentCharCode]; currentCharCode = get next source char ; } if (tokenKind == error code ) report lexical error ; move current source position to tokend return new Token(tokenKind, source chars from tokbegin to tokend-1 );

We don’t do this by hand anymore! Writing scanners is a rather “robotic” activity which can be automated. JLex input: a set of r.e.’s and action code output a fast lexical analyzer (scanner) based on a DFA Or the lexer is built into the parser generator as in JavaCC

JLex Lexical Analyzer Generator for Java We will look at an example JLex specification (adopted from the manual). Consult the manual for details on how to write your own JLex specifications. Definition of tokens Regular Expressions JLex Java File: Scanner Class Recognizes Tokens

The JLex tool Layout of JLex file: user code (added to start of generated file) %%   options %{ user code (added inside the scanner class declaration) %} macro definitions lexical declaration User code is copied directly into the output class JLex directives allow you to include code in the lexical analysis class, change names of various components, switch on character counting, line counting, manage EOF, etc. Macro definitions gives names for useful regexps Regular expression rules define the tokens to be recognised and actions to be taken

The JLex tool: Example An example: import java_cup.runtime.*; %%   %class Lexer %unicode %cup %line %column %state STRING  %{ ...

The JLex tool %state STRING %{   %{ StringBuffer string = new StringBuffer(); private Symbol symbol(int type) { return new Symbol(type, yyline, yycolumn); } private Symbol symbol(int type, Object value) { return new Symbol(type, yyline, yycolumn, value); %} ...

The JLex tool %} LineTerminator = \r|\n|\r\n InputCharacter = [^\r\n] WhiteSpace = {LineTerminator} | [ \t\f] /* comments */ Comment = {TraditionalComment} | {EndOfLineComment} | TraditionalComment = "/*" {CommentContent} "*"+ "/" EndOfLineComment= "//"{InputCharacter}* {LineTerminator} CommentContent = ( [^*] | \*+ [^/*] )* Identifier = [:jletter:] [:jletterdigit:]* DecIntegerLiteral = 0 | [1-9][0-9]* %% ...

The JLex tool ... %% <YYINITIAL> "abstract" { return symbol(sym.ABSTRACT); } <YYINITIAL> "boolean" { return symbol(sym.BOOLEAN); } <YYINITIAL> "break" { return symbol(sym.BREAK); }   <YYINITIAL> { /* identifiers */ {Identifier} { return symbol(sym.IDENTIFIER);} /* literals */ {DecIntegerLiteral} { return symbol(sym.INT_LITERAL);}

The JLex tool ... /* literals */ {DecIntegerLiteral} { return symbol(sym.INT_LITERAL);} \" { string.setLength(0); yybegin(STRING); } /* operators */ "=" { return symbol(sym.EQ); } "==" { return symbol(sym.EQEQ); } "+" { return symbol(sym.PLUS); } /* comments */ {Comment} { /* ignore */ } /* whitespace */ {WhiteSpace} { /* ignore */ } }

The JLex tool ... <STRING> { \" { yybegin(YYINITIAL); return symbol(sym.STRINGLITERAL, string.toString()); } [^\n\r\"\]+ { string.append( yytext() ); } \\t { string.append('\t'); } \\n { string.append('\n'); } \\r { string.append('\r'); } \\" { string.append('\"'); } \\ { string.append('\'); } }

JLex generated Lexical Analyser Class Yylex Name can be changed with %class directive Default construction with one arg – the input stream You can add your own constructors The method performing lexical analysis is yylex() Public Yytoken yylex() which return the next token You can change the name of yylex() with %function directive String yytext() returns the matched token string Int yylenght() returns the length of the token Int yychar is the index of the first matched char (if %char used) Class Yytoken Returned by yylex() – you declare it or supply one already defined You can supply one with %type directive Java_cup.runtime.Symbol is useful Actions typically written to return Yytoken(…)

Conlusions Don’t worry too much about DFAs You do need to understand how to specify regular expressions Note that different tools have different notations for regular expressions. You would probably only need to use JLex (Lex) if you use also use CUP (or Yacc or SML-Yacc)