ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
CS252: Systems Programming Ninghui Li Topic 4: Regular Expressions and Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
CS 497C – Introduction to UNIX Lecture 31: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
Tools for building compilers Clara Benac Earle. Tools to help building a compiler C –Lexical Analyzer generators: Lex, flex, –Syntax Analyzer generator:
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Scripting Languages Chapter 8 More About Regular Expressions.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
CS 536 Spring Learning the Tools: JLex Lecture 6.
Last Updated March 2006 Slide 1 Regular Expressions.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
1 Flex. 2 Flex A Lexical Analyzer Generator  generates a scanner procedure directly, with regular expressions and user-written procedures Steps to using.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Review: Regular expression: –How do we define it? Given an alphabet, Base case: – is a regular expression that denote { }, the set that contains the empty.
Lecture 2: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – JLex Scanner Generator.
Lexical Analysis (I) Compiler Baojian Hua
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Flex: A fast Lexical Analyzer Generator CSE470: Spring 2000 Updated by Prasad.
What is a language? An alphabet is a well defined set of characters. The character ∑ is typically used to represent an alphabet. A string : a finite.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
JLex Lecture 4 Mon, Jan 24, JLex JLex is a lexical analyzer generator in Java. It is based on the well-known lex, which is a lexical analyzer generator.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
CPS 506 Comparative Programming Languages Syntax Specification.
Introduction to Lex Ying-Hung Jiang
Introduction to Lex Fan Wu
Flex Fast LEX analyzer CMPS 450. Lexical analysis terms + A token is a group of characters having collective meaning. + A lexeme is an actual character.
Practical 1-LEX Implementation
CSc 453 Lexical Analysis (Scanning)
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
Lex & Yacc By Hathal Alwageed & Ahmad Almadhor. References *Tom Niemann. “A Compact Guide to Lex & Yacc ”. Portland, Oregon. 18 April 2010 *Levine, John.
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Recursive Definations Regular Expressions Ch # 4 by Cohen
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 614: Theory and Construction of Compilers Lecture 5 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
Compilers Computer Symbol Table Output Scanner (lexical analysis)
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Scanning & Regular Expressions CPSC 388 Ellen Walker Hiram College.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Deterministic Finite Automata Nondeterministic Finite Automata.
Lexical Analysis.
Chapter 3 Lexical Analysis.
Tutorial On Lex & Yacc.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
JLex Lecture 4 Mon, Jan 26, 2004.
Review: Compiler Phases:
Lecture 4: Lexical Analysis & Chomsky Hierarchy
Appendix B.1 Lex Appendix B.1 -- Lex.
Lex Appendix B.1 -- Lex.
Presentation transcript:

ICS611 Lex Set 3

Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program produced by Lex. This program serves as a subroutine of the C program produced by YACC for the parser

Lexical Analysis LEX employs as input a description of the tokens that can occur in the language This description is made by means of regular expressions, as defined on the next slide. Regular expressions define patterns of characters.

Basics of Regular Expressions 1. Any character (or string of characters) except those (called metacharacters) which have a special interpretation, such as () [] {} + * ? | etc. For instance the string “if” in a regular expression will match the identical string in the source code.

2. The period symbol “.” is used to match any single character in the source code except the new line indicator "\n".

3.Square brackets are used to define a character class. Either a sequence of symbols or a range denoted using the hyphen can be employed,e.g.: [01a-z] A character class matches a single symbol in the source code that is a member of the class. For instance [01a-z] matches the character 0 or 1 or any lower case alphabetic character

4. The "+" symbol following a regular expression denotes 1 or more occurrences of that expression. For instance [0-9]+ matches any sequence of digits in the source code.

Similarly: 5. A "*" following a regular expression denotes 0 or more occurrences of that expression. 6. A “?" following a regular expression denotes 0 or 1 occurrence of that expression.

7. The symbol “|” is used as an OR operator to identify alternate choices. For instance [a-z]+|9 matches either a lower case alphabetic or the digit “9”.

8. Parentheses can be freely used. For example: (a|b)+ matches e.g. abba while a|b+ match a or a string of b’s.

9. Regular expressions can be concatenated For instance: [a-zA-Z]*[0-9]+[a-zA-Z] matches any sequence of 0 or more letters, followed by 1 or more digits, followed by 1 letter

As has been shown, symbols such as +, *, ?,., (, ), [,] have special meanings in regular expressions. 10. If you want to include one of these symbols in a regular expression simply as a character, you can either use the c escape symbol “\” or double quotes. For example: [0-9]”+”[0-9] or [0-9]\+[0-9] match a digit followed by a plus sign, followed by a digit

Examples Given: R = ( abb | cd ) and S = abc RS = ( abbabc | cdabc ) is a regular expression. SR = ( abcabb | abccd ) is a regular expression. The following strings are matched by R*: abbcdcdcdcd  cdabbcdabbabbcd abb cd cdcdcdcdcdcdcd and so forth.

What kinds of strings can be matched by the regular expression: ( a | c )* b ( a | c )* ( a | c )* is a regular expression that can match the empty string , or any string containing only a's and c's. b is a regular expression that can match a single occurrence of the symbol "b". ( a | c )* is the same as the first regular expression. So, the entire expression: ( a | c )* b ( a | c )* can match any string made up of a possibly empty string of a's and c's, followed by a single b, followed by a possibly empty string of a’s and c’s In other words the regular expression can match any string on the alphabet {a,b,c} that contains exactly one b.

What kinds of strings can be matched by the regular expression: ( a | c )* ( b |  ) ( a | c )* This is the same as the previous example, except that the regular expression in the center is now: ( b |  ) ( b |  ) can match either an occurrence of a single b, or the empty string which contains no characters So the entire expression ( a | c )* ( b |  ) ( a | c )* can match any string over the alphabet {a,b,c} that contains either 0 or 1 b's.

Precedence of Operations in Regular Expressions From highest to lowest Concatenation Closure (*) Alternation ( OR ) Examples: a | bcf means the symbol a OR the string bcf a( bcf* ) is the string abc followed by 0 or more repetitions of the symbol f. Note: this is the same as (abcf*)

GRAMMARS vs REGULAR EXPRESSIONS Consider the set of strings (ie. language) {a n b a n | n > 0} A grammar that generates this language is: S -> b b -> a b a However, as we will show later, it is not possible to construct a regular expression that recognizes this language.

In the Lex definition file one can assign macro names to regular expressions e.g.: digit 0|1|2|...|9 assigns the macro name digit integer {digit}+ assigns the macro name integer to 1 or more repetitions of digit NOTE. when using a macro name as part of a regular expression, you need to enclose the name in curly parentheses {}. Signed_int (+|-)?{integer} assigns macro name signed_int to an optional sign followed by an integer number {signed_int}(\.{integer})?(E{signed_int})? assigns the macro name number to a signed_int followed by an optional fractional part followed by an optional exponent part

alpha [a-zA-Z] assigns the macro name alpha to the character class given by a-z and A-Z identifier {alpha}({alpha}|{digit})* assigns the macro name identifier to an alpha character followed by the alternation of either alpha characters or digits, with 0 or more repetitions.

RULE Given the above examples of defined character strings for LEX, what would be the first token of the following string? MAX23= Z Lex picks as the "next" token, the longest string that can be matched by one of it regular expressions. In this case, MAX23 would be matched as an identifier, not just M or MA or MAX

An example of a Lex definition file /* A standalone LEX program that counts identifiers and commas */ /* Definition Section */ %{ int nident = 0; /* # of identifiers in the file being scanned */ int ncomma = 0; /* # of commas in the file */ %} /* definitions of macro names*/ digit [0-9] alph [a-zA-Z] % /* Rules Section */ /* basic of patterns to recognize and the code to execute when they occur */ {alph}({alph}|{digit})* {++nident;} "," {++ncomma;}. ; %