Chapter 3. Lexical Analysis (1). 2 Interaction of lexical analyzer with parser.

Slides:



Advertisements
Similar presentations
Lexical Analysis Dragon Book: chapter 3.
Advertisements

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
邱锡鹏 复旦大学计算机科学技术学院 References 2.
Chapter 2 Lexical Analysis Nai-Wei Lin. Lexical Analysis Lexical analysis recognizes the vocabulary of the programming language and transforms a string.
Chapter 3 Lexical Analysis Yu-Chen Kuo.
Chapter 3 Lexical Analysis. Definitions The lexical analyzer produces a certain token wherever the input contains a string of characters in a certain.
LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.
CS-338 Compiler Design Dr. Syed Noman Hasany Assistant Professor College of Computer, Qassim University.
 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
1 Chapter 2: Scanning 朱治平. Scanner (or Lexical Analyzer) the interface between source & compiler could be a separate pass and places its output on an.
2. Lexical Analysis Prof. O. Nierstrasz
Chapter 3 Chang Chi-Chung. The Structure of the Generated Analyzer lexeme Automaton simulator Transition Table Actions Lex compiler Lex Program lexemeBeginforward.
Lexical Analysis Recognize tokens and ignore white spaces, comments
Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source program) – divides it into tokens.
 We are given the following regular definition: if -> if then -> then else -> else relop -> |>|>= id -> letter(letter|digit)* num -> digit + (.digit.
Chapter 3 Lexical Analysis
Topic #3: Lexical Analysis
1 Flex. 2 Flex A Lexical Analyzer Generator  generates a scanner procedure directly, with regular expressions and user-written procedures Steps to using.
Lexical Analysis Natawut Nupairoj, Ph.D.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
1 Chapter 1 Introduction to the Theory of Computation.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer (Checker)
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
Fall 2007CMPS 450 Lexical Analysis CMPS 450 J. Moloney.
CH3.1 CS 345 Dr. Mohamed Ramadan Saady Algebraic Properties of Regular Expressions AXIOMDESCRIPTION r | s = s | r r | (s | t) = (r | s) | t (r s) t = r.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analyzer in Perspective
1 Lexical Analysis and Lexical Analyzer Generators Chapter 3 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
1.  It is the first phase of compiler.  In computer science, lexical analysis is the process of converting a sequence of characters into a sequence.
Using Scanner Generator Lex By J. H. Wang May 10, 2011.
CSc 453 Lexical Analysis (Scanning)
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Scanner Introduction to Compilers 1 Scanner.
The Role of Lexical Analyzer
Lexical Analysis (Scanning) Lexical Analysis (Scanning)
Lexical Analysis.
1st Phase Lexical Analysis
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Chapter2 : Lexical Analysis
Lexical Analysis (Tokenizing) COMP 3002 School of Computer Science.
Lexical Analyzer in Perspective
CS510 Compiler Lecture 2.
Scanner Scanner Introduction to Compilers.
Chapter 3 Lexical Analysis.
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
CSc 453 Lexical Analysis (Scanning)
COMPILER DESIGN UNIT-I.
Lexical Analysis Why separate lexical and syntax analyses?
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Chapter 3: Lexical Analysis
Lexical Analysis and Lexical Analyzer Generators
Review: Compiler Phases:
Recognition of Tokens.
Scanner Scanner Introduction to Compilers.
Specification of tokens using regular expressions
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
Scanner Scanner Introduction to Compilers.
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Chapter 3. Lexical Analysis (1)

2 Interaction of lexical analyzer with parser.

3 Lexical Analysis  Issues –Simpler design is preferred –Compiler efficiency is improved –Compiler portability is improved  Terms –Tokens  terminal symbols in a grammar –Patterns  rules to describing strings of a token –Lexemes  a set of strings matched by the pattern

4 TOKENSAMPLE LEXEMESINFORMAL DESCRIPTION OF PATTERN const if relation id num literal const if, >, >= pi, count, D , 0, 6.02 E 23 " core dumped " const if or >= or > letter followed by letters and digits any numeric constant any characters between " and " except " Examples of tokens.

5 Difficulties in implementing lexical analyzers  FORTRAN –No delimiter is used –DO 5 I=1.25  DO 5 I=1,25  DO 5 I= 1 25  PL/I –Keywords are not reserved –IF THEN THEN THEN = ELSE; ELSE ELSE=THEN;

6 Attributes for tokens  A lexical analyzer collects information about tokens into their associated attributes  Example –E = M * C ** 2  generally stored in constant table

7 Lexical Errors  Rules for error recovery –Deleting an extraneous character –Inserting a missing character –Replacing an incorrect character by a correct character –Transposing two adjacent characters  Minimum-distance erroneous correction  Example –Detectable : 2as3, 2#31, … –Undetectable : fi(a == f(x)) …

8 Input Buffering  A single buffer could make a big difficulty – 두 버퍼 사이에 있는 word –Declare (arg1, …., argn)  array or function  Buffer pairs –A good solution –Sentinels 을 쓰면 매번 버퍼의 끝인지와 파일 의 끝인지를 동시에 검사할 필요가 없음

9 Sentinels at end of each buffer half.

10 Specification of Tokens  Strings and languages –Alphabet or character class  finite set of symbols –String  sentence  word –|s|  length of a string s – ε : empty string, Ф ={ε} : empty set – x, y are strings  xy : concatenation, εx = x ε = x  Operations on languages

11 Terms for parts of a string. TERMDEFINTION prefix of s A string obtained by removing zero or more trailing symbols of string s; e.g., ban is a prefix of banana. suffix of s A string formed by deleting zero or more of the leading symbols of s; e.g., nana is a suffix of banana. substring of s A string obtained by deleting a prefix and a suffix from s; e.g., nan is a substring of banana. Every prefix and every suffix of s is a substring of s, but not every substring of s is a prefix or a suffix of s. For every string s, both s and  are prefixes, suffixes, and substrings of s. proper prefix, suffix, or substring of s Any nonempty string x that is, respectively, a prefix, suffix, or substring of s such that s  x. subsequence of s Any string formed by deleting zero or more not necessarily contiguous symbols from s; e.g., baaa is a subsequence of banana.

12 Definitions of operations on languages. OPERATIONDEFINITION union of L and M written L  M. L  M = {s | s is in L or s is in M} concatenation of L and M written LM LM = { st | s is in L and t is in M } Kleene closure of L written L* L* denotes “zero or more concatenations of” L. positive closure of L written L + L + denotes “one or more concatenations of” L.

13 Regular Expressions 1.  is a regular expression that denotes {  }, that is, the set containing the empty string. 2. If a is symbol in , then a is a regular expression that denotes {a}, i.e., the set containing the string a. Although we use the same notation for all three, technically, the regular expression a is different from the string a or the symbol a. It will be clear from the context whether we are talking about a as a regular expression, string, or symbol. 3. Suppose r and s are regular expressions denoting the language L(r) and L(s). Then, a) (r)|(s) is a regular expression denoting L(r)  L(s). b) (r)(s) is a regular expression denoting L(r)L(s). c) (r)* is a regular expression denoting (L(r))*. d) (r) is a regular expression denoting L(r).

14 Examples on operations in regular expressions  Σ ={a,b}  alphabets – a | b  {a,b} –(a|b)(c|d)  {ac, ad, bc, bd} – a*  { ε, a, aa, aaa, …} –(a|b)*  (a*|b*)* – aa* = a+, ε|a+ = a* –(a|b) = (b|a)

15 Algebraic properties of regular expressions. AXIOMDESCRIPTION r|s = s|r| is commutative r|(s|t) = (r|s)|t| is associative (rs)t = r(st)concatenation is associative r(s|t) = rs|rt (s|t)r = sr|tr concatenation distributes over |  r = r r  = r  is the identity element for concatenation r* = (r|  )*relation between * and  r** = r** is idempotent

16 Regular Definitions  Regular definition – d1  r1 d2  r2 …. dn  rn 예 letter  A|B| … |Z|a|b| … |z digit  0|1| … | 9 id  letter (letter|digit)*

17 Unsigned numbers  Pascal digit  0|1| … |9 digits  digit digit* operational_fraction . digits | ε optional_exponent  (E(+|-| ε) digits | ε num  digits operational_fraction optional_exponent

18 Notational Shorthands (1/2) 1.One or more instances. The unary postfix operator + means “one or more instances of.” If r is a regular expression that denotes the language L(r), then (r) + is a regular expression that denotes the language (L(r)) +. Thus, the regular expression a + denotes the set of all strings of one or more a’s. The operator + has the same precedence and associativity as the operator *. The two algebraic identities r* = r + |  and r + = rr* relate the Kleene and positive closure operators. 2.Zero or one instance. The unary postfix operator ? means “zero or one instance of.” The notation r? is a shorthand for r| . If r is a regular expression, then, (r)? is a regular expression that denotes the language L(r)  {  }. For example, using the + and ? operators, we can rewrite the regular definition for num in Example 3.5 as

19 Notational Shorthands (2/2) 3.Character classes. The notation [ abc ] where a, b, and c are alphabet symbols denotes the regular expression a | b | c. An abbreviated character class such as [ a – z ] denotes the regular expression a | b | ··· | z. Using character classes, we can describe identifiers as being strings generated by the regular expression [ A – Za – z ][ A – Za – z0 – 9 ] * digit digits optional _fraction optional_exponent num  0 | 1 | ··· | 9  digit +  (. digits )?  ( E ( + | - )? digits )?  Digits optional_fraction optional_exponent

20 Nonregular set  {wcw - 1 |w is a string of a’s and b’s}  context-free grammar is required to represent the string

21 Regular-expression patterns for tokens. REGULAR EXPRESSION TOKENATTRIBUTE-VALUE ws if then else id num < <= = > >= - if then else id num relop - pointer to table entry LT LE EQ NE GT GE

22 Transition diagram  Finite-state automata  states and edges  몇 가지 예를 보여줌 ….  다음 페이지,  그림 3.14 는 앞의 예를 바탕으로 그림

23 Transition diagram for identifiers and keywords.

24 Lex 에 의한 구현  Regular definition  finite automata, transition diagram  C 프로그램으로 출력  Lexical analysis, pattern matching, …

25 Creating a lexical analyzer with Lex.

26 Lex program for the tokens of Fig (1/2) %{ /*definitions of manifest constants LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ID, NUMBER, RELOP */ %} /*regular definitions */ delim [ \ t \ n ] ws { delim }+ letter [ A-Za-z ] digit [ 0 – 9 ] id { letter } ( { letter } | { digit } )* number { digit } + ( \.{ digit } + ) ? ( E [ + \ - ] ? { digit } + ) ?

27 Lex program for the tokens of Fig (2/2) % { ws }{ /* no action and no return */ } if{ return(IF); } then{ return(THEN); } else{ return(ELSE); } { id }{ yylval = install_id(); return(ID); } { number }{ yylval = install_num(); return(NUMBER); } “<”{ yylval = LT; return(RELOP); } “<=”{ yylval = LE; return(RELOP); } “=”{ yylval = EQ; return(RELOP); } “<>”{ yylval = NE; return(RELOP); } “>”{ yylval = GT; return(RELOP); } “>=”{ yylval = GE; return(RELOP); } % install_id() { /* procedure to install the lexeme, whose first character is pointed to by yytext and whose length is yyleng, into the symbol table and return a pointer thereto */ } install_num() { /* similar procedure to install a lexeme that is a number */ }

28 Lookahead operator  DO 5 I = 1.25  DO 5 I=1,25 –DO/({letter | digit})* = ({letter} | {digit})*, –DO/{id}* = {digit}*,  IF(I,J)=3  IF(condition) statement –IF/ \(.* \) {letter}