Lecture 2: Lexical Analysis

Slides:



Advertisements
Similar presentations
Lexical Analysis Consider the program: #include main() { double value = 0.95; printf("value = %f\n", value); } How is this translated into meaningful machine.
Advertisements

COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Chapter 2 Lexical Analysis Nai-Wei Lin. Lexical Analysis Lexical analysis recognizes the vocabulary of the programming language and transforms a string.
 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
1 Chapter 2: Scanning 朱治平. Scanner (or Lexical Analyzer) the interface between source & compiler could be a separate pass and places its output on an.
Tools for building compilers Clara Benac Earle. Tools to help building a compiler C –Lexical Analyzer generators: Lex, flex, –Syntax Analyzer generator:
Chapter 3 Chang Chi-Chung. The Structure of the Generated Analyzer lexeme Automaton simulator Transition Table Actions Lex compiler Lex Program lexemeBeginforward.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Lecture 2: Lexical Analysis CS 540 George Mason University.
CS 536 Spring Learning the Tools: JLex Lecture 6.
Topic #3: Lexical Analysis
1 Flex. 2 Flex A Lexical Analyzer Generator  generates a scanner procedure directly, with regular expressions and user-written procedures Steps to using.
Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
Compilers: lex/3 1 Compiler Structures Objectives – –describe lex – –give many examples of lex's use , Semester 1, Lex.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Review: Regular expression: –How do we define it? Given an alphabet, Base case: – is a regular expression that denote { }, the set that contains the empty.
CPSC 388 – Compiler Design and Construction Scanners – JLex Scanner Generator.
Lexical Analysis (I) Compiler Baojian Hua
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
FLEX Fast Lexical Analyzer EECS Introduction Flex is a lexical analysis (scanner) generator. Flex is provided with a user input file or Standard.
Flex: A fast Lexical Analyzer Generator CSE470: Spring 2000 Updated by Prasad.
LEX (04CS1008) A tool widely used to specify lexical analyzers for a variety of languages We refer to the tool as Lex compiler, and to its input specification.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
JLex Lecture 4 Mon, Jan 24, JLex JLex is a lexical analyzer generator in Java. It is based on the well-known lex, which is a lexical analyzer generator.
Lexical Analysis – Part I EECS 483 – Lecture 2 University of Michigan Monday, September 11, 2006.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
CPS 506 Comparative Programming Languages Syntax Specification.
Introduction to Lex Ying-Hung Jiang
1 Using Lex. 2 Introduction When you write a lex specification, you create a set of patterns which lex matches against the input. Each time one of the.
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
1 Using Lex. Flex – Lexical Analyzer Generator A language for specifying lexical analyzers Flex compilerlex.yy.clang.l C compiler -lfl a.outlex.yy.c a.outtokenssource.
Introduction to Lex Fan Wu
Introduction to Lexical Analysis and the Flex Tool. © Allan C. Milne Abertay University v
Practical 1-LEX Implementation
1 Lex & Yacc. 2 Compilation Process Lexical Analyzer Source Code Syntax Analyzer Symbol Table Intermed. Code Gen. Code Generator Machine Code.
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.
Lexical Analysis.
1st Phase Lexical Analysis
1 Steps to use Flex Ravi Chotrani New York University Reviewed By Prof. Mohamed Zahran.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
LECTURE 6 Scanning Part 2. FROM DFA TO SCANNER In the previous lectures, we discussed how one might specify valid tokens in a language using regular expressions.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
9-December-2002cse Tools © 2002 University of Washington1 Lexical and Parser Tools CSE 413, Autumn 2002 Programming Languages
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture Ahmed Ezzat.
Theory of Computation Lecture #
CS510 Compiler Lecture 2.
Lexical Analysis.
Chapter 3 Lexical Analysis.
NFAs, scanners, and flex.
Lexical Analysis CSE 340 – Principles of Programming Languages
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Regular Languages.
Review: Compiler Phases:
Compiler Structures 3. Lex Objectives , Semester 2,
Appendix B.1 Lex Appendix B.1 -- Lex.
Systems Programming & Operating Systems Unit – III
Lex Appendix B.1 -- Lex.
Presentation transcript:

Lecture 2: Lexical Analysis

Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine of parser: Simpler design Efficient Portable Next_token() Next_char() Input Scanner Parser character token Symbol Table

Definitions token – set of strings defining an atomic element with a defined meaning pattern – a rule describing a set of string lexeme – a sequence of characters that match some pattern

Examples Token Pattern Sample Lexeme while relation_op = | != | < | > < integer (0-9)* 42 string Characters between “ “ “hello”

Input string: size := r * 32 + c <token,lexeme> pairs: <id, size> <assign, :=> <id, r> <arith_symbol, *> <integer, 32> <arith_symbol, +> <id, c>

Implementing a Lexical Analyzer Practical Issues: Input buffering Translating RE into executable form Must be able to capture a large number of tokens with single machine Interface to parser Tools

Capturing Multiple Tokens Capturing keyword “begin” b e g i n WS WS – white space A – alphabetic AN – alphanumeric Capturing variable names A WS AN What if both need to happen at the same time?

Capturing Multiple Tokens b e g i n WS A-b WS – white space A – alphabetic AN – alphanumeric AN AN WS Machine is much more complicated – just for these two tokens!

Lex – Lexical Analyzer Generator specification Flex/Lex lex.yy.c C/C++ compiler a.out input tokens

Lex Specification Definitions – Code, RE Rules – RE/Action pairs %{ int charCount=0, wordCount=0, lineCount=0; %} word [^ \t\n]* %% {word} {wordCount++; charCount += yyleng; } [\n] {charCount++; lineCount++;} . {charCount++;} main() { yylex(); printf(“Characters %d, Words: %d, Lines: %d\n”,charCount, wordCount, lineCount); } Definitions – Code, RE Rules – RE/Action pairs User Routines

Lex definitions section %{ int charCount=0, wordCount=0, lineCount=0; %} word [^ \t\n]* C/C++ code: Surrounded by %{… %} delimiters Declare any variables used in actions RE definitions: Define shorthand for patterns: digit [0-9] letter [a-z] ident {letter}({letter}|{digit})* Use shorthand in RE section: {ident}

Lex Regular Expressions {word} {wordCount++; charCount += yyleng; } [\n] {charCount++; lineCount++;} . {charCount++;} Match explicit character sequences integer, “+++”, \<\> Character classes [abcd] [a-zA-Z] [^0-9] – matches non-numeric

Lex Regular Expressions(cont.) Alternation twelve | 12 Closure * - zero or more + - one or more ? – zero or one {number}, {number,number}

Lex Regular Expressions(cont.) Other operators . – matches any character except newline ^ - matches beginning of line $ - matches end of line / - trailing context () – grouping {} – RE definitions

Lex Matching Rules Lex always attempts to match the longest possible string. If two rules are matched (and match strings are same length), the first rule in the specification is used.

Lex Operators Highest: closure concatenation alternation Special lex characters: - \ / * + > “ { } . $ ( ) | % [ ] ^ Special lex characters inside [ ]: - \ [ ] ^

Examples a.*z (ab)+ [0—9]{1,5} (ab|cd)?ef = abef,cdef,ef -?[0-9]\.[0-9]

Lex Actions Lex actions are C (C++) code to implement some required functionality Default action is to echo to output Can ignore input (empty action) ECHO – macro that prints out matched string yytext – matched string yyleng – length of matched string

User Subroutines C/C++ code Copied directly into the lexer code main() { yylex(); printf(“Characters %d, Words: %d, Lines: %d\n”,charCount, wordCount, lineCount); } C/C++ code Copied directly into the lexer code User can supply ‘main’ or use default

Uses for Lex Transforming Input – convert input from one form to another (example 1). yylex() is called once; return is not used in specification Extracting Information – scan the text and return some information (example 2). yylex() is called once; return is not used in specification. Extracting Tokens – standard use with compiler (example 3). Uses return to give the next token to the caller.

Regular expression A regular expression is a kind of pattern that can be applied to text (Strings, in Java) A regular expression either matches the text (or part of the text), or it fails to match. Regular expressions are an extremely useful tool for manipulating text Regular expressions are heavily used in the automatic generation of Web pages

Pattern matching applications: Scan for virus signatures Process natural languages Search for information using Google Search and replace in word processors Filter text( spam, malware ) Validate data-entry field (dates, email, url)

Basic Operation Notation to specify a set of strings

Regular expression : examples Notation is surprisingly expressive. Regular Expression Yes No a* | (a*ba*ba*ba*)* multiple of 3 b's ε bbb aaa abbbaaa bbbaababbaa b bb abbaaaa baabbbaa a | a(a|b)*a begins and ends with a a aba aa abbaabba ε ab ba (a|b)*abba(a|b)* contains the substring abba abba bbabbabb abbaabba ε abb bbaaba

Using Regular expression Built in to Java, Perl , PHP, Unix, .NET,…. Additional operations typically aded for convenience. Ex. [a-e]+ is shorthand for (a|b|c|d|e) (a|b|c|d|e)* Operation Regular Expression Yes No Concatenation hello Othello say hello Hello Any single character ..oo..oo. bloodroot spoonfood cookbook choochoo

Using Regular expression Operation Regular Expression Yes No Replication a(bc)*de ade abcde abcbcde abc One or more a(bc)+de abcde abcbcde ade abc Once or not at all a(bc)?de ade abcde abc abcbcde Character classes [a-m]* blackmail imbecile above below Negation of character classes [^aeiou] b c a e Exactly N times [^aeiou]{6} rhythm syzygy rhythms allowed Between M and N times [a-z]{4,6} spider tiger jellyfish cow Whitespace characters [a-z\s]*hello hello say hello Othello 2hello