Prof. Fateman CS 164 Lecture 51 Lexical Analysis Lecture 5.

Slides:



Advertisements
Similar presentations
Lexical Analysis Dragon Book: chapter 3.
Advertisements

COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
CSE 3302 Programming Languages Chengkai Li, Weimin He Spring 2008 Syntax Lecture 2 - Syntax, Spring CSE3302 Programming Languages, UT-Arlington ©Chengkai.
Lecture 3-4 Notes by G. Necula, with additions by P. Hilfinger
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
Prof. Hilfinger CS 164 Lecture 21 Lexical Analysis Lecture 2-4 Notes by G. Necula, with additions by P. Hilfinger.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
Lexical Analysis Recognize tokens and ignore white spaces, comments
CS 426 Compiler Construction
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
Topic #3: Lexical Analysis
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
CSC312 Automata Theory Lecture # 2 Languages.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Winter 2007SEG2101 Chapter 71 Chapter 7 Introduction to Languages and Compiler.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis Hira Waseem Lecture
Languages, Grammars, and Regular Expressions Chuck Cusack Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5 th edition, by Kenneth.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
D. M. Akbar Hussain: Department of Software & Media Technology 1 Compiler is tool: which translate notations from one system to another, usually from source.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
1.  It is the first phase of compiler.  In computer science, lexical analysis is the process of converting a sequence of characters into a sequence.
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 4.
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
CSE 425: Syntax I Syntax and Semantics Syntax gives the structure of statements in a language –Allowed ordering, nesting, repetition, omission of symbols.
The Role of Lexical Analyzer
1st Phase Lexical Analysis
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Prof. Necula CS 164 Lecture 31 Lexical Analysis Lecture 3-4.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
CS510 Compiler Lecture 2.
Lecture 2 Lexical Analysis
A Simple Syntax-Directed Translator
Chapter 3 Lexical Analysis.
Lexical Analysis CSE 340 – Principles of Programming Languages
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical Analysis (Sections )
Compiler Construction
Review: Compiler Phases:
Lexical Analysis Lecture 3-4 Prof. Necula CS 164 Lecture 3.
CS 3304 Comparative Languages
Lecture 4: Lexical Analysis & Chomsky Hierarchy
CS 3304 Comparative Languages
Compiler Construction
CSC312 Automata Theory Lecture # 2 Languages.
CSc 453 Lexical Analysis (Scanning)
Faculty of Computer Science and Information System
Presentation transcript:

Prof. Fateman CS 164 Lecture 51 Lexical Analysis Lecture 5

Prof. Fateman CS 164 Lecture 52 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular expressions –Examples of regular expressions

Prof. Fateman CS 164 Lecture 53 Lexical Analysis What do we want to do? Example: if (i == j) Z = 0; else Z = 1; The input is just a string of characters: #\tab i f ( i = = j ) #\newline #\tab z = 0 ; #\newline e l s e #\newline #\tab #\tab z = 1 ; Goal: Partition input string into substrings where the substrings are tokens or lexemes.

Prof. Fateman CS 164 Lecture 54 What’s a Token? A syntactic category –In English: noun, verb, adjective, … –In a programming language: Identifier, Integer, Single-Float, Double-Float, operator (perhaps single or multiple character), Comment, Keyword, Whitespace, string constant, …

Prof. Fateman CS 164 Lecture 55 Defining Tokens more precisely Token categories correspond to sets of strings, some of them finite sets, like keywords, but some, like identifiers, unbounded. Identifier: (typically) strings of letters or digits, starting with a letter. Some languages restrict length of identifier. Lisp does not require first letter… Integer: a non-empty string of digits. Bounded? Keyword: else or if or begin or … Whitespace: a non-empty sequence of blanks, newlines, and tabs.

Prof. Fateman CS 164 Lecture 56 How are Tokens created? This is the job of the Lexer, the first pass over the source code. The Lexer classifies program substrings according to role. We also tack on to each token the location (line and column) of where it ends, so we can report errors in the source code by location. (Personally, I hate this: it assumes “batch processing” is normal and makes “interactive processing” and debugging messier). We read tokens until we get to an end-of-file.

Prof. Fateman CS 164 Lecture 57 How are Tokens used? Output of lexical analysis is a stream of tokens... In Lisp, we just make a list.. (token1 token2 token3 ….). [In an interactive system we would just burp out tokens as we find their ends; we don’t wait for EOF] This stream is the input to the next pass, the parser. –(parser (lexer “filename”)) in Lisp. Parser makes use of token distinctions –An identifier FOO may be treated differently from a keyword ELSE. (Some language have no keywords. Lisp.)

Prof. Fateman CS 164 Lecture 58 Designing a Lexical Analyzer: Step 1 Define a finite set of tokens –Tokens describe all items of interest –Choice of tokens depends on language, design of parser Recall \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; Useful tokens for this expression: Integer, Keyword, operator, Identifier, Whitespace, (, ), =, ;

Prof. Fateman CS 164 Lecture 59 Designing a Lexical Analyzer: Step 2 Describe which strings belong to each token category Recall: –Identifier: strings of letters or digits, starting with a letter –Integer: a non-empty string of digits –Keyword: “else” or “if” or “begin” or … –Whitespace: a non-empty sequence of blanks, newlines, and tabs

Prof. Fateman CS 164 Lecture 510 Lexical Analyzer: Implementation An implementation reads characters until it finds the end of a token. (Longest possible single token). It then returns a package of up to 3 items: –What kind of thing is it? Identifier? Number? Keyword? –If it is an identifier, which one exactly? Foo? Bar? –Where did this token appear in the source code (line/col). –Whitespace and comments are skipped.

Prof. Fateman CS 164 Lecture 511 Example Recall: \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; Possible Token-lexeme groupings: –Identifier: i, j, z –Keyword: if, else –Multi-character operator: == (not in MiniJava) –Integer: 0, 1 –(, ), =, ; single character operators of the same name –Illegal. E.g. for MiniJava, ^ % $ … everything else?

Prof. Fateman CS 164 Lecture 512 Writing a Lexical Analyzer: Comments The lexer usually discards “uninteresting” tokens that don’t contribute to parsing, but has to keep track of line/column counts. If a comment can “Run off the end” it means that you must check for EOF when skipping comments too. Sometimes comments are embedded in comments. What to do? /* 34 /* foo */ 43 */. Java handles this “wrong”. Sometimes compiler directives are inside comments, so comments are really NOT ignored. /**set optimization = 0 */

Prof. Fateman CS 164 Lecture 513 True Crimes of Lexical Analysis Is it as easy as it sounds? Not quite! Look at some history...

Prof. Fateman CS 164 Lecture 514 Lexical Analysis in FORTRAN FORTRAN rule: Whitespace is insignificant E.g., VAR1 is the same as VA R1 Footnote: FORTRAN whitespace rule motivated by inaccuracy of punch card operators

Prof. Fateman CS 164 Lecture 515 Now we see it as a terrible design! Example Consider –DO 5 I = 1,25 –DO 5 I = 1.25 The first starts a loop, 25 times, running until the statement labeled “5”. – DO 5 I = 1, 25 The second is DO5I = 1.25 an assignment Reading left-to-right, cannot tell if DO5I is a variable or DO stmt. until after “,” is reached

Prof. Fateman CS 164 Lecture 516 A nicer design limits the amount of lookahead needed Even our simple example has lookahead issues ii vs. if & vs. && (not in MJ…) But this uses only one or two characters, and is certainly resolved by the time you find white space, or another character, e.g. X&Y or X&&Y. In C, consider a+++++b. Is it a b ? Or (a++)++ +b or some other breakup?

Prof. Fateman CS 164 Lecture 517 Review The goal of lexical analysis is to –Partition the input string into lexemes –Identify the token type, and perhaps the value of each lexeme (probably. E.g. convert_string_to_integer(“12345") ) Left-to-right scan => lookahead sometimes required.

Prof. Fateman CS 164 Lecture 518 Next –We need a way to describe the lexemes of each token class so that we can write a program (or build/use a program that automatically takes the description and writes the lexer!) –Fortunately, we can start from a well-studied area in theoretical computer science, Formal Languages and Automata Theory.

Prof. Fateman CS 164 Lecture 519 Languages Def. Let S be a set of characters. A language over S is a set of strings of characters drawn from S (  is called the alphabet. Pronounced SIGMA. Why are we using Greek? So as to separate the language from the meta language. Presumably our alphabet does not include  as a character.)

Prof. Fateman CS 164 Lecture 520 Examples of Languages Alphabet = English characters Language = English sentences Not every string of English characters is an English sentence. Alphabet = ASCII, Language = C programs, Note: ASCII character set is different from English character set. UNICODE is another, much larger set, used by Java, Common Lisp.

Prof. Fateman CS 164 Lecture 521 One program’s language can be another’s alphabet An alphabet {A,B,C,D,E,—Z} can be used to form a word language: {CAT DOG EAT HAT IN THE …}. The “alphabet” of {CAT, DOG, EAT ….} can be used to form a different language e.g. {THE CAT IN THE HAT, ….}

Prof. Fateman CS 164 Lecture 522 Another kind of alphabet, in this case, for MJ Numbers Keywords Identifiers [an infinite set] Other tokens like + - * / ()[]{} Then the MJ language is defined over this pre-tokenized alphabet

Prof. Fateman CS 164 Lecture 523 Notation We repeat: Languages are sets of strings. Subsets of “the set of all strings over an alphabet S ”. Need some tools for specifying which subset we want to designate a particular language. For English, and an alphabet of words from the dictionary, we can use a grammar. (It almost works). For Java programs we can also use a grammar. But for now, we want to specify the set of tokens. That is, map the language of single characters into “sentences” that are tokens. Tokens have a simple structure.

Prof. Fateman CS 164 Lecture 524 Regular Languages There are several formalisms for specifying tokens. Regular languages are the most popular –Simple and useful theory –Easy to understand –Efficient implementations are possible –Almost powerful enough (Can’t do nested comments) –Popular “almost automatic” tools to write programs. The standard notation for regular languages is regular expressions.

Prof. Fateman CS 164 Lecture 525 Atomic Regular Expressions Single character denotes a set of one string Epsilon character denotes a set of one 0- length string Empty set is {} =  not the same as . –Size(  )=0. Size( e ) =1

Prof. Fateman CS 164 Lecture 526 Compound Regular Expressions Union: If A and B are REs then… Concatenation of Sets  Concatenation of strings Iteration (Kleene closure)

Prof. Fateman CS 164 Lecture 527 In particular, (notationally confusing concatenation with multiplication, union with addition...) A* =  + A +AA + AAA + … we also sometimes use A + = A +AA + AAA + … = AA*

Prof. Fateman CS 164 Lecture 528 Regular Expressions Def. The regular expressions over S are the smallest set of expressions including

Prof. Fateman CS 164 Lecture 529 Syntax vs. Semantics This notation means this set

Prof. Fateman CS 164 Lecture 530 Example: Keyword Keyword: “else” or “if” or “begin” or … Note: ' else ' denotes the same set as "e""l""s""e"

Prof. Fateman CS 164 Lecture 531 Sometimes we use convenient names for sets Keywords= { 'else' 'if' 'then' 'when‘...}

Prof. Fateman CS 164 Lecture 532 Example: Integers Integer: a non-empty string of digits Integers are almost digit + Though, is 000 an integer?

Prof. Fateman CS 164 Lecture 533 Sometimes we use () in reg exps For example A(B+C) denotes the set AB+AC

Prof. Fateman CS 164 Lecture 534 Sometimes we need a language with () in it! For example { "+" "(" ")"} denotes a set with 3 strings in it, using our metasymbols +(). Because our programming language will usually be in the same alphabet as our description of REs, we need to take some care that our metasymbols can be distinguished from our alphabet by special marks like ". Often authors leave off these marks if they think the reader will understand. If you leave such marks off in programs, you may be in trouble.

Prof. Fateman CS 164 Lecture 535 Example: Identifier Identifier: strings of letters or digits, starting with a letter letter = 'A'+'B'+…+'Z'+'a'+…+'z'+'_' identifier = letter(letter + digit)* Is (letter + + digit)* the same? Is (letter* + digit*) the same? Is letter + (letter + + digit)* the same?

Prof. Fateman CS 164 Lecture 536 Example: Whitespace Whitespace: a non-empty sequence of blanks, newlines, and tabs

Prof. Fateman CS 164 Lecture 537 Example: Phone Numbers Regular expressions are all around you! Consider (510)

Prof. Fateman CS 164 Lecture 538 Example: Addresses Consider

Prof. Fateman CS 164 Lecture 539 Summary Regular expressions describe many useful languages Regular languages are (exactly) those languages expressible as regular expressions. –We still need an implementation to take a specification of rexp R and produce a function that answers the question: Given a string s is next time.