1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.

Slides:



Advertisements
Similar presentations
4b Lexical analysis Finite Automata
Advertisements

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
From Cooper & Torczon1 Automating Scanner Construction RE  NFA ( Thompson’s construction )  Build an NFA for each term Combine them with  -moves NFA.
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
Scanner 中正理工學院 電算中心副教授 許良全. Copyright © 1998 by LCH Compiler Design Overview of Scanning n The purpose of a scanner is to group input characters into.
2. Lexical Analysis Prof. O. Nierstrasz
College of Computer Science & Technology Compiler Construction Principles & Implementation Techniques -1- Compiler Construction Principles & Implementation.
CS5371 Theory of Computation Lecture 4: Automata Theory II (DFA = NFA, Regular Language)
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
1.Defs. a)Finite Automaton: A Finite Automaton ( FA ) has finite set of ‘states’ ( Q={q 0, q 1, q 2, ….. ) and its ‘control’ moves from state to state.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
Scanner Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Regular Languages A language is regular over  if it can be built from ;, {  }, and { a } for every a 2 , using operators union ( [ ), concatenation.
Formal Language Finite set of alphabets Σ: e.g., {0, 1}, {a, b, c}, { ‘{‘, ‘}’ } Language L is a subset of strings on Σ, e.g., {00, 110, 01} a finite language,
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
1 Chapter 3 Scanning - Theory and Practice Prof Chung. 10/8/2015.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Lexical Analyzer (Checker)
1 Chapter 3 Scanning – Theory and Practice. 2 Overview of scanner A scanner transforms a character stream of source file into a token stream. It is also.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
CMSC 330: Organization of Programming Languages Theory of Regular Expressions Finite Automata.
CSE 425: Syntax I Syntax and Semantics Syntax gives the structure of statements in a language –Allowed ordering, nesting, repetition, omission of symbols.
C Chuen-Liang Chen, NTUCS&IE / 35 SCANNING Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei,
Donghyun (David) Kim Department of Mathematics and Physics North Carolina Central University 1 Chapter 1 Regular Languages Some slides are in courtesy.
Modeling Computation: Finite State Machines without Output
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
June 13, 2016 Prof. Abdelaziz Khamis 1 Chapter 2 Scanning – Part 2.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
1 Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Department of Software & Media Technology
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Lecture 2 Lexical Analysis
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical Analysis (Sections )
Lexical analysis Finite Automata
Two issues in lexical analysis
Recognizer for a Language
Chapter 2 FINITE AUTOMATA.
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
4b Lexical analysis Finite Automata
Finite Automata & Language Theory
4b Lexical analysis Finite Automata
Instructor: Aaron Roth
Chapter 1 Regular Language
Lecture 5 Scanning.
Lexical Analysis Uses formalism of Regular Languages
Presentation transcript:

1 Chapter 3 Scanning – Theory and Practice

2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in Pascal –Can a string split across a line? –Is a null string allowed? –Is.1 or 10. ok? –the problem Scanner generators –tables –Programs What formal notations to use?

3 Regular Expressions Tokens are built from symbols of a finite vocabulary. We use regular expressions to define structures of tokens.

4 Regular Expressions The sets of strings defined by regular expressions are termed regular sets Definition of regular expressions –  is a regular expression denoting the empty set – is a regular expression denoting the set that contains only the empty string –A string s is a regular expression denoting a set containing only s –if A and B are regular expressions, so are A | B (alternation) AB (concatenation) A* (Kleene closure)

5 Regular Expressions (Cont’d) some notational convenience P + == PP* Not(A) == V - A Not(S) == V* - S A K == AA …A (k copies)

6 Regular Expressions (Cont’d) Some examples Let D = (0 | 1 | 2 | 3 | 4 |... | 9 ) L = (A | B |... | Z) comment = -- not(EOL)* EOL decimal = D + · D + ident = L (L | D)* (_ (L | D) + )* comments = ##((#| )not(#))* ##

7 Regular Expressions (Cont’d) Is regular expression as power as CFG? { [ i ] i | i  1}

8 R Scanner Generator P P A string S Yes No S  L( R ) S  L( R ) Given a regular expression Generated program

9 Finite Automata and Scanners A finite automaton (FA) can be used to recognize the tokens specified by a regular expression A FA consists of –A finite set of states –A set of transitions (or moves) from one state to another, labeled with characters in V –A special start state –A set of final, or accepting, states

10 A transition diagram This machine accepts abccabc, but it rejects abcab. This machine accepts (abc + ) +.

11 Example (0|1)*0(0|1)(0|1) ,1 0,1 0,1 0 0,1

12 Example RealLit = (D + ( |.))|(D*.D + )

13 Example ID = L(L|D)*(_(L|D) + )*

14 Finite Automata and Scanners Two kinds of FA: –Deterministic: next transition is unique DFA –Non-deterministic: otherwise NFA... a a

15 A transition table of a DFA

16 Finite Automata and Scanners Any regular expression can be translated into a DFA that accepts the set of strings denoted by the regular expression The transition can be done –Automatically by a scanner generator –Manually by a programmer

17

18

19 Finite Automata and Scanners Transducer –We may perform some actions during state transition.

20

21 Practical Consideration Reserved Words –Usually, all keywords are reserved in order to simplify parsing. –Assume that in Pascal begin and end are not reserved begin begin; end; end; begin; end if if than else = then; The problem with reserved words is that they are too numerous. –COBOL has several hundred reserved words!

Practical Consideration (Cont’d) The fact that reserved words look just like identifiers complicates the specification of a scanner via regular expressions –Using “NOT” Not(Not(Id)|begin|end|…) –Just write down the regular expression directly L|(LL)|((LLL)L + )|((L-′E′)L * | (L(L-′N′)L * )|(LL(L-′D′)L * ) –Treat reserved words as ordinary identifiers –Create distinct regular expressions for each reserved word 22

23 Practical Consideration (Cont’d) Compiler Directives and Listing Source Lines –Compiler options e.g. optimization, profiling, etc. handled by scanner or semantic routines Complex pragmas are treated like other statements. –Source inclusion e.g. #include in C handled by preprocessor or scanner –Conditional compilation e.g. #if, #endif in C useful for creating program versions

24 Practical Consideration (Cont’d) Entry of Identifiers into the Symbol Table Who is responsible for entering symbols into symbol table? –Scanner? –Consider this example: { int abc; … { int abc; } }

25 Practical Consideration (Cont’d) How to handle end-of-file? – Create a special EOF token. EOF token is useful in a CFG Multicharacter Lookahead –Blanks are not significant in Fortran DO 10 I = 1,100 –Beginning of a loop DO 10 I = –An assignment statement DO10I=1.100 A Fortran scanner can determine whether the O is the last character of a DO token only after reading as far as the comma

26 Practical Consideration (Cont’d) Multicharacter Lookahead (Cont’d) –In Ada and Pascal To scan –There are three token »1 ».. »100 –Two-character lookahead after the 1

27 Practical Consideration (Cont’d) Multicharacter Lookahead (Cont’d) –It is easy to build a scanner that can perform general backup. –If we reach a situation in which we are not in final state and cannot scan any more characters, backup is invoked. Until we reach a prefix of the scanned characters flagged as a valid token

28

29 Translating Regular Expressions into Finite Automata Regular expressions are equivalent to FAs The main job of a scanner generator –To transform a regular expression definition into an equivalent FA A regular expression Nondeterministic FA Deterministic FA Optimized Deterministic FA minimize # of states

30 R Scanner Generator P P A string S Yes No S  L( R ) S  L( R ) Given a regular expression Generated program

31 Translating Regular Expressions into Finite Automata A FA is nondeterministic:

32 Translating Regular Expressions into Finite Automata We can transform any regular expression into an NFA with the following properties: –There is an unique final state –The final state has no successors –Every other state has either one or two successors

33 Translating Regular Expressions into Finite Automata We need to review the definition of regular expression 1. (null string) 2. a (a char of the vocabulary) 3. A|B (or) 4. AB (cancatenation) 5. A* (repetition)

34

35 This figure is wrong. The final state has no successor

36 Construct an NFA for Regular Expression 01 * |1 01 * |1  (0(1 * ))|1  start 1*1*  01 *      start 01 * |1 

37 Creating Deterministic Automata The transformation from an NFA N to an equivalent DFA M works by what is sometimes called the subset construction –Step 1: The initial state of M is the set of states reachable from the initial state of N by -transitions

38 Creating Deterministic Automata –Step 2: To create the successor states Take any state S of M and any character c, and compute S’s successor under c –S is identified with some set of N’s states, {n 1, n 2,…} –Find all possible successor states to {n 1, n 2,…} under c »Obtain a set {m 1, m 2,…} –T=close({m 1, m 2,…}) ST {n 1, n 2,…}close({m 1, m 2,…})

39 Creating Deterministic Automata

40

41 Optimizing Finite Automata minimize number of states –Every DFA has a unique smallest equivalent DFA –Given a DFA M, we use splitting to construct the equivalent minimal DFA. –Initially, there are two sets, one consisting all accepting states of M, the other the remaining states. (S 1, S 2, , S i, , S j,  S k, ,S l,  ) (P 1, P 2, , P n )(N 1, N 2, , N m ) cccccc

42 (S 1, S 2, , S i, , S j,  S k, ,S l, ,S x,,S y, ) (P 1, P 2, , P n )(N 1, N 2, , N m ) cccccc (S 1, S 2, S j,  )(S i, S k, S l,  )(S x, S y,  ) Note that S x and S y no transaction on c

43

44 Initially, two sets {1, 2, 3, 5, 6}, {4, 7}. {1, 2, 3, 5, 6} splits {1, 2, 5}, {3, 6} on c. {1, 2, 5} splits {1}, {2, 5} on b.