어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;

Slides:



Advertisements
Similar presentations
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Advertisements

From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Lexical Analysis — Introduction Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
1 CIS 461 Compiler Design and Construction Fall 2012 slides derived from Tevfik Bultan et al. Lecture-Module 5 More Lexical Analysis.
CSc 453 Lexical Analysis (Scanning)
LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
2. Lexical Analysis Prof. O. Nierstrasz
Compiler Construction
Lexical Analysis — Part II From Regular Expression to Scanner Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
Scanner Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Topic #3: Lexical Analysis
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis Constructing a Scanner from Regular Expressions.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Lexical Analyzer (Checker)
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
What is a language? An alphabet is a well defined set of characters. The character ∑ is typically used to represent an alphabet. A string : a finite.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
CSc 453 Lexical Analysis (Scanning)
Lexical Analysis: DFA Minimization & Wrap Up. Automating Scanner Construction PREVIOUSLY RE  NFA ( Thompson’s construction ) Build an NFA for each term.
Exercise 1 Consider a language with the following tokens and token classes: ID ::= letter (letter|digit)* LT ::= " " shiftL ::= " >" dot ::= "." LP ::=
Lexical Analysis (Scanning) Lexical Analysis (Scanning)
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
1st Phase Lexical Analysis
Exercise Solution for Exercise (a) {1,2} {3,4} a b {6} a {5,6,1} {6,2} {4} {3} {5,6} { } b a b a a b b a a b a,b b b a.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CMPSC 160 Translation of Programming Languages Fall 2002 Instructor: Hugh McGuire Lecture 2 Phases of a Compiler Lexical Analysis.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
1 Compiler Construction Vana Doufexi office CS dept.
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Compilers Lexical Analysis 1. while (y < z) { int x = a + b; y += x; } 2.
Department of Software & Media Technology
COMP 3438 – Part II - Lecture 3 Lexical Analysis II Par III: Finite Automata Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
CS314 – Section 5 Recitation 2
Lecture 2 Lexical Analysis
Chapter 3 Lexical Analysis.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
CSc 453 Lexical Analysis (Scanning)
CSc 453 Lexical Analysis (Scanning)
Two issues in lexical analysis
Recognizer for a Language
Lexical Analysis - An Introduction
Review: Compiler Phases:
Compiler Construction
Lexical Analysis - An Introduction
COMPILERS LECTURE(6-Aug-13)
Compiler Construction
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

어휘분석 (Lexical Analysis)

Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace; –Correlate error messages with source program (e.g., line number of error).

Overview (cont ’ d)

Lexical Analysis: Terminology token: a name for a set of input strings with related structure. Example: “ identifier, ” “ integer constant ” pattern: a rule describing the set of strings associated with a token. Example: “ a letter followed by zero or more letters, digits, or underscores. ” lexeme: the actual input string that matches a pattern. Example: count

Examples Input: count = 123 Tokens: identifier : Rule: “ letter followed by …” Lexeme: count assg_op : Rule: = Lexeme: = integer_const : Rule: “ digit followed by …” Lexeme: 123

Why token? Language syntax is specified with token types, not words (lexeme) 1. goal  expr 2. expr  expr op term 3. | term 4. term  number 5. | id 6. op  + 7. | – S = goal T = { number, id, +, - } N = { goal, expr, term, op } P = { 1, 2, 3, 4, 5, 6, 7}

Specifying Tokens: regular expressions Terminology: –alphabet : a finite set of symbols –string : a finite sequence of alphabet symbols –language : a (finite or infinite) set of strings. Regular Operations on languages: Union: R  S = { x | x  R or x  S} Concatenation: RS = { xy | x  R and y  S} Kleene closure: R* = R concatenated with itself 0 or more times = {  }  R  RR  RRR  = strings obtained by concatenating a finite number of strings from the set R.

Regular Expressions A pattern notation for describing certain kinds of sets over strings: Given an alphabet  : –  is a regular exp. (denotes the language {  }) –for each a  , a is a regular exp. (denotes the language {a}) –if r and s are regular exps. denoting L(r) and L(s) respectively, then so are: (r) | (s) ( denotes the language L(r)  L(s) ) (r)(s) ( denotes the language L(r)L(s) ) (r)* ( denotes the language L(r)* )

Common Extensions to r.e. Notation One or more repetitions of r : r+ A range of characters : [a-zA-Z], [0-9] An optional expression: r? Any single character:. Giving names to regular expressions, e.g.: –letter = [a-zA-Z_] –digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 –ident = letter ( letter | digit )* –Integer_const = digit+

Examples of Regular Expressions Identifiers: Letter  (a|b|c| … |z|A|B|C| … |Z) Digit  (0|1|2| … |9) Identifier  Letter ( Letter | Digit ) * Numbers: [0-9][0-9]+[0-9]* [1-9][0-9]* ([1-9][0-9]*)|0 -?[0-9]+ [0-9]*\.[0-9]+([0-9]+)|([0-9]*\.[0-9]+) [eE][-+]?[0-9]+([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)? -?( ([0-9]+) | ([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)? )

Examples of Regular Expressions Numbers: Integer  (+|-|  ) (0| (1|2|3| … |9)(Digit * ) ) Decimal  Integer. Digit * Real  ( Integer | Decimal ) E (+|-|  ) Digit * Complex  ( Real, Real )

Exercise of Regular Expressions 문자열 [a-z][a-zA-Z][a-zA-Z0-9] [a-zA-Z][a-zA-Z0-9]* 스트링 –"this is a string" \".*\"<- wrong!!! why? \"[^"]*\" 몇가지 연습 0 과 1 로 이루어진 문자열 중에서... –0 으로 시작하는 문자열 0[01]* –0 으로 시작해서 0 으로 끝나는 문자열 0[01]*0 –0 과 1 이 번갈아 나오는 문자열 ______ –0 이 두번 계속 나오지 않는 문자열 ______

Recognizing Tokens: Finite Automata A finite automaton is a 5-tuple (Q, , T, q0, F), where: –  is a finite alphabet; –Q is a finite set of states; –T: Q    Q is the transition function; –q0  Q is the initial state; and –F  Q is a set of final states.

Finite Automata: An Example A (deterministic) finite automaton (DFA) to match C-style comments:

Consider the problem of recognizing register names Register  r (0|1|2| … | 9) (0|1|2| … | 9) * Allows registers of arbitrary number Requires at least one digit RE corresponds to a recognizer (or DFA ) Transitions on other inputs go to an error state, s e Example 2 S0S0 S2S2 S1S1 r (0|1|2| … 9) accepting state (0|1|2| … 9) Recognizer for Register

DFA operation Start in state S 0 & take transitions on each input character DFA accepts a word x iff x leaves it in a final state (S 2 ) So, r17 takes it through s 0, s 1, s 2 and accepts r takes it through s 0, s 1 and fails a takes it straight to s e Example 2 ( continued ) S0S0 S2S2 S1S1 r (0|1|2| … 9) accepting state (0|1|2| … 9) Recognizer for Register

Example 2 (continued) To be useful, recognizer must turn into code sese sese sese sese sese s2s2 sese s2s2 sese s2s2 sese s1s1 sese sese s1s1 s0s0 All others 0,1,2,3,4,5, 6,7,8,9r  Char  next character State  s 0 while (Char  EOF) State   (State,Char) Char  next character if (State is a final state ) then report success else report failure Skeleton recognizer Table encoding RE

r Digit Digit * allows arbitrary numbers Accepts r00000 Accepts r99999 What if we want to limit it to r0 through r31 ? Write a tighter regular expression –Register  r ( (0|1|2) (Digit |  ) | (4|5|6|7|8|9) | (3|30|31) ) –Register  r0|r1|r2| … |r31|r00|r01|r02| … |r09 Produces a more complex DFA Has more states Same cost per transition Same basic implementation What if we need a tighter specification?

Tighter register specification (continued) The DFA for Register  r ( (0|1|2) (Digit |  ) | (4|5|6|7|8|9) | (3|30|31) ) Accepts a more constrained set of registers Same set of actions, more states S0S0 S5S5 S1S1 r S4S4 S3S3 S6S6 S2S2 0,1,20,1,2 3 0,10,1 4,5,6,7,8,94,5,6,7,8,9 (0|1|2| … 9)

sese sese sese sese sese S1S1 s0s0 sese sese sese sese sese sese sese sese sese sese sese sese sese s6s6 sese sese sese sese s6s6 sese s5s5 sese sese sese sese sese sese s4s4 sese sese sese sese sese sese s3s3 sese s3s3 s3s3 s3s3 s3s3 sese s2s2 sese s4s4 s5s5 s2s2 s2s2 sese s1s1 All others4-9320,1r  Table encoding RE for the tighter register specification Tighter register specification (continued)

Automating Scanner Construction RE→ NFA (Thompson’s construction) –Build an NFA for each term –Combine them with ε-moves NFA → DFA (subset construction) –Build the simulation DFA → Minimal DFA –Hopcroft’s algorithm DFA →RE (Not part of the scanner construction) –All pairs, all paths problem –Take the union of all paths from s0 to an accepting state