Lecture 2: Lexical Analysis CS 540 George Mason University.

Slides:



Advertisements
Similar presentations
Lexical Analysis Consider the program: #include main() { double value = 0.95; printf("value = %f\n", value); } How is this translated into meaningful machine.
Advertisements

COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Compiler construction in4020 – lecture 2 Koen Langendoen Delft University of Technology The Netherlands.
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
 Lex helps to specify lexical analyzers by specifying regular expression  i/p notation for lex tool is lex language and the tool itself is refered to.
1 CMPSC 160 Translation of Programming Languages Fall 2002 slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module #4 Lexical.
Tools for building compilers Clara Benac Earle. Tools to help building a compiler C –Lexical Analyzer generators: Lex, flex, –Syntax Analyzer generator:
Chapter 3 Chang Chi-Chung. The Structure of the Generated Analyzer lexeme Automaton simulator Transition Table Actions Lex compiler Lex Program lexemeBeginforward.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Scanning with Jflex.
1 Material taught in lecture Scanner specification language: regular expressions Scanner generation using automata theory + extra book-keeping.
A brief [f]lex tutorial Saumya Debray The University of Arizona Tucson, AZ
CS 536 Spring Learning the Tools: JLex Lecture 6.
Compiler Construction Lexical Analysis Rina Zviel-Girshin and Ohad Shacham School of Computer Science Tel-Aviv University.
1 Flex. 2 Flex A Lexical Analyzer Generator  generates a scanner procedure directly, with regular expressions and user-written procedures Steps to using.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Compilers: lex/3 1 Compiler Structures Objectives – –describe lex – –give many examples of lex's use , Semester 1, Lex.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Review: Regular expression: –How do we define it? Given an alphabet, Base case: – is a regular expression that denote { }, the set that contains the empty.
Lecture 2: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – JLex Scanner Generator.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
FLEX Fast Lexical Analyzer EECS Introduction Flex is a lexical analysis (scanner) generator. Flex is provided with a user input file or Standard.
Flex: A fast Lexical Analyzer Generator CSE470: Spring 2000 Updated by Prasad.
LEX (04CS1008) A tool widely used to specify lexical analyzers for a variety of languages We refer to the tool as Lex compiler, and to its input specification.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
Lexical Analysis – Part I EECS 483 – Lecture 2 University of Michigan Monday, September 11, 2006.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
CPS 506 Comparative Programming Languages Syntax Specification.
What on Earth? LEXEMETOKENPATTERN print p,r,i,n,t (leftpar( 4number4 *arith* 5number5 )rightpar) userAnswerID Letter followed by letters and digits “Game.
Introduction to Lex Ying-Hung Jiang
1 Using Lex. 2 Introduction When you write a lex specification, you create a set of patterns which lex matches against the input. Each time one of the.
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
1 Using Lex. Flex – Lexical Analyzer Generator A language for specifying lexical analyzers Flex compilerlex.yy.clang.l C compiler -lfl a.outlex.yy.c a.outtokenssource.
Introduction to Lex Fan Wu
Introduction to Lexical Analysis and the Flex Tool. © Allan C. Milne Abertay University v
Flex Fast LEX analyzer CMPS 450. Lexical analysis terms + A token is a group of characters having collective meaning. + A lexeme is an actual character.
Practical 1-LEX Implementation
1 Lex & Yacc. 2 Compilation Process Lexical Analyzer Source Code Syntax Analyzer Symbol Table Intermed. Code Gen. Code Generator Machine Code.
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
1 Steps to use Flex Ravi Chotrani New York University Reviewed By Prof. Mohamed Zahran.
Scanner Generation Using SLK and Flex++ Followed by a Demo Copyright © 2015 Curt Hill.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
1 February 23, February 23, 2016February 23, 2016February 23, 2016 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University.
LECTURE 6 Scanning Part 2. FROM DFA TO SCANNER In the previous lectures, we discussed how one might specify valid tokens in a language using regular expressions.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
9-December-2002cse Tools © 2002 University of Washington1 Lexical and Parser Tools CSE 413, Autumn 2002 Programming Languages
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture Ahmed Ezzat.
Lexical Analysis.
NFAs, scanners, and flex.
Using SLK and Flex++ Followed by a Demo
Regular Languages.
TDDD55- Compilers and Interpreters Lesson 2
Lexical Analysis Why separate lexical and syntax analyses?
Review: Compiler Phases:
Compiler Structures 3. Lex Objectives , Semester 2,
Appendix B.1 Lex Appendix B.1 -- Lex.
More on flex.
Systems Programming & Operating Systems Unit – III
Lex Appendix B.1 -- Lex.
Presentation transcript:

Lecture 2: Lexical Analysis CS 540 George Mason University

CS 540 Spring 2013 GMU2 Lexical Analysis - Scanning Scanner (lexical analysis) Parser (syntax analysis) Code Optimizer Semantic Analysis (IC generator) Code Generator Symbol Table Tokens described formally Breaks input into tokens White space Source language tokens

CS 540 Spring 2013 GMU3 Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine of parser: Simpler design Efficient Portable Input ScannerParser Symbol Table Next_char() charactertoken Next_token()

CS 540 Spring 2013 GMU4 Definitions token – set of strings defining an atomic element with a defined meaning pattern – a rule describing a set of string lexeme – a sequence of characters that match some pattern

CS 540 Spring 2013 GMU5 Examples TokenPatternSample Lexeme while relation_op= | != | < integer(0-9)*42 stringCharacters between “ “ “hello”

CS 540 Spring 2013 GMU6 Input string: size := r * 32 + c pairs:

CS 540 Spring 2013 GMU7 Implementing a Lexical Analyzer Practical Issues: Input buffering Translating RE into executable form Must be able to capture a large number of tokens with single machine Interface to parser Tools

CS 540 Spring 2013 GMU8 Capturing Multiple Tokens Capturing keyword “begin” Capturing variable names What if both need to happen at the same time? beginWS WS – white space A – alphabetic AN – alphanumeric A AN WS

CS 540 Spring 2013 GMU9 Capturing Multiple Tokens beginWS WS – white space A – alphabetic AN – alphanumeric A-b AN WS AN Machine is much more complicated – just for these two tokens! WS

CS 540 Spring 2013 GMU10 Real lexer (handcoded) Comes from C# compiler in Rotor >950 lines of C++

CS 540 Spring 2013 GMU11 Lex – Lexical Analyzer Generator Lex compiler exec Lex specification C/C++/Java inputtokens flex – more modern version of lex jflex – java version of flex

CS 540 Spring 2013 GMU12 Lex Specification (flex) %{ int charCount=0, wordCount=0, lineCount=0; %} word [^ \t\n]+ % {word} {wordCount++; charCount += strlen(yytext); } [\n] {charCount++; lineCount++;}. {charCount++;} % main() { yylex(); printf(“Characters %d, Words: %d, Lines: %d\n”, charCount, wordCount, lineCount); } Definitions – Code, RE Rules – RE/Action pairs User Routines

CS 540 Spring 2013 GMU13 Lex Specification (jflex) import java.io.*; % %class ex1 %unicode %line %column %standalone %{ static int charCount = 0, wordCount = 0, lineCount = 0; public static void main(String [] args) throws IOException { ex1 lexer = new ex1(new FileReader(args[0])); lexer.yylex(); System.out.println("Characters: " + charCount + " Words: " + wordCount +" Lines: " +lineCount); } %} %type Object //this line changes the return type of yylex into Object word = [^ \t\n]+ % {word} {wordCount++; charCount += yytext().length(); } [\n] {charCount++; lineCount++; }. {charCount++; } Definitions – Code, RE Rules – RE/Action pairs

CS 540 Spring 2013 GMU14 Lex definitions section C/C++/Java code: –Surrounded by %{… %} delimiters –Declare any variables used in actions RE definitions: –Define shorthand for patterns: digit [0-9] letter [a-z] ident {letter}({letter}|{digit})* –Use shorthand in RE section: {ident} %{ int charCount=0, wordCount=0, lineCount=0; %} word [^ \t\n]+

CS 540 Spring 2013 GMU15 Lex Regular Expressions Match explicit character sequences –integer, “+++”, \ Character classes –[abcd] –[a-zA-Z] –[^0-9] – matches non-numeric {word}{wordCount++; charCount += strlen(yytext); } [\n]{charCount++; lineCount++;}.{charCount++;}

CS 540 Spring 2013 GMU16 Alternation –twelve | 12 Closure –* - zero or more –+ - one or more –? – zero or one –{number}, {number,number}

CS 540 Spring 2013 GMU17 Other operators –. – matches any character except newline –^ - matches beginning of line –$ - matches end of line –/ - trailing context –() – grouping –{} – RE definitions

CS 540 Spring 2013 GMU18 Lex Operators Highest: closure concatenation alternation Special lex characters: - \ / * + > “ { }. $ ( ) | % [ ] ^ Special lex characters inside [ ]: - \ [ ] ^

CS 540 Spring 2013 GMU19 Examples a.*z (ab)+ [0-9]{1,5} (ab|cd)?ef = abef,cdef,ef -?[0-9]\.[0-9]

CS 540 Spring 2013 GMU20 Lex Actions Lex actions are C (C++, Java) code to implement some required functionality Default action is to echo to output Can ignore input (empty action) ECHO – macro that prints out matched string yytext – matched string yyleng – length of matched string (not all versions have this) In Java: yytext() and yytext().length()

CS 540 Spring 2013 GMU21 User Subroutines C/C++/Java code Copied directly into the lexer code User can supply ‘main’ or use default main() { yylex(); printf(“Characters %d, Words: %d, Lines: %d\n”,charCount, wordCount, lineCount); }

CS 540 Spring 2013 GMU22 How Lex works Lex works by processing the file one character at a time, trying to match a string starting from that character. 1.Lex always attempts to match the longest possible string. 2.If two rules are matched (and match strings are same length), the first rule in the specification is used. Once it matches a string, it starts from the character after the string

CS 540 Spring 2013 GMU23 Lex Matching Rules 1.Lex always attempts to match the longest possible string. beg{…} begin{…} in{…} Input ‘begin’ can match either of the first two rules. The second rule will be chosen because of the length.

CS 540 Spring 2013 GMU24 Lex Matching Rules 2.If two rules are matched (the matched strings are same length), the first rule in the specification is used. begin{… } [a-z]+{…} Input ‘begin’ can match both rules – the first one will be chosen

CS 540 Spring 2013 GMU25 Lex Example: Extracting white space %{ #include %} % [ \t\n];.{ECHO;} % To compile and run above (simple.l):flex simple.l gcc lex.yy.c –ll g++ -x c++ lex.yy.c –ll a.out < inputa.out < input -lfl on some systems

CS 540 Spring 2013 GMU26 Lex Example: Extracting white space (Java) % %class ex0 %unicode %line %column %standalone % [^ \t\n] {System.out.print(yytext());}. {} [\n] {} To compile and run above (simple.l): java -jar ~cs540/JFlex.jar simple.l javac ex0.java java ex0 inputfile name of class to build

CS 540 Spring 2013 GMU27 Input: This is a file of stuff we want to extract all white space from Output: Thisisafileofstuffwewantoextractallwhitespacefrom

CS 540 Spring 2013 GMU28 Lex (C/C++) Lex always creates a file ‘lex.yy.c’ with a function yylex() -ll directs the compiler to link to the lex library (-lfl on some systems) The lex library supplies external symbols referenced by the generated code The lex library supplies a default main: main(int ac,char **av) {return yylex(); }

CS 540 Spring 2013 GMU29 Lex Example 2: Unix wc %{ int charCount=0, wordCount=0, lineCount=0; %} word [^ \t\n]+ % {word}{wordCount++; charCount += strlen(yytext); } [\n]{charCount++; lineCount++;}.{charCount++;} % main() { yylex(); printf(“Characters %d, Words: %d, Lines: %d\n”,charCount, wordCount, lineCount); }

CS 540 Spring 2013 GMU30 Lex Example 3: Extracting tokens % andreturn(AND); arrayreturn(ARRAY); beginreturn(BEGIN);. \[return(‘[‘); “:=“return(ASSIGN); [a-zA-Z][a-zA-Z0-9_]*return(ID); [+-]?[0-9]+return(NUM); [ \t\n]; %

CS 540 Spring 2013 GMU31 Uses for Lex Transforming Input – convert input from one form to another (example 1). yylex() is called once; return is not used in specification Extracting Information – scan the text and return some information (example 2). yylex() is called once; return is not used in specification. Extracting Tokens – standard use with compiler (example 3). Uses return to give the next token to the caller.

CS 540 Spring 2013 GMU32 Lex States Regular expressions are compiled to state machines. Lex allows the user to explicitly declare multiple states. %s COMMENT Default initial state INITIAL (0) Actions for matched strings may be different for different states

CS 540 Spring 2013 GMU33 Lex States %{ int ctr = 0; int linect = 1; %} %s COMMENT %.ECHO; [\n]{linect++; ECHO;} ”/*”{BEGIN COMMENT; ctr = 1;}.; [\n]linect++; ”*/”{if (ctr == 1) BEGIN INITIAL; else ctr--; } ”/*”{ctr++;} %