Lexical Analysis CSE 340 – Principles of Programming Languages

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
CSE 3302 Programming Languages Chengkai Li, Weimin He Spring 2008 Syntax Lecture 2 - Syntax, Spring CSE3302 Programming Languages, UT-Arlington ©Chengkai.
Prof. Fateman CS 164 Lecture 51 Lexical Analysis Lecture 5.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
CPSC 388 – Compiler Design and Construction
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
1 Syntax Specification Regular Expressions. 2 Phases of Compilation.
CSC312 Automata Theory Lecture # 2 Languages.
CSC312 Automata Theory Lecture # 2 Languages.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Formal Methods in SE Theory of Automata Qasiar Javaid Assistant Professor Lecture # 06.
Lecture Two: Formal Languages Formal Languages, Lecture 2, slide 1 Amjad Ali.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
1 Language Definitions Lecture # 2. Defining Languages The languages can be defined in different ways, such as Descriptive definition, Recursive definition,
1 Chapter 1 Introduction to the Theory of Computation.
Languages, Grammars, and Regular Expressions Chuck Cusack Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5 th edition, by Kenneth.
Grammars CPSC 5135.
Lexical Analysis (I) Compiler Baojian Hua
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Module 2 How to design Computer Language Huma Ayub Software Construction Lecture 8.
L ECTURE 3 Chapter 4 Regular Expressions. I MPORTANT T ERMS Regular Expressions Regular Languages Finite Representations.
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
What is a language? An alphabet is a well defined set of characters. The character ∑ is typically used to represent an alphabet. A string : a finite.
Introduction to Theory of Automata By: Wasim Ahmad Khan.
1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II:
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
MA/CSSE 474 Theory of Computation Regular Expressions Intro.
CPS 506 Comparative Programming Languages Syntax Specification.
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Recursive Definations Regular Expressions Ch # 4 by Cohen
Formal Languages and Grammars
Lecture # 4.
Mathematical Foundations of Computer Science Chapter 3: Regular Languages and Regular Grammars.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
Scanning & Regular Expressions CPSC 388 Ellen Walker Hiram College.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
1 Strings and Languages Lecture 2-3 Ref. Handout p12-17.
Lecture 03: Theory of Automata:2014 Asif Nawaz Theory of Automata.
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Languages and Strings Chapter 2. (1) Lexical analysis: Scan the program and break it up into variable names, numbers, etc. (2) Parsing: Create a tree.
Lexical Analyzer in Perspective
Theory of Computation Lecture #
Lecture # 2.
CS510 Compiler Lecture 2.
Language Theory Module 03.1 COP4020 – Programming Language Concepts Dr. Manuel E. Bermudez.
Lecture 2 Lexical Analysis
Chapter 3 Lexical Analysis.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Formal Language Theory
COP4620 – Programming Language Translators Dr. Manuel E. Bermudez
REGULAR LANGUAGES AND REGULAR GRAMMARS
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Compilers CSCI/CMPE 3334 David Egle.
Review: Compiler Phases:
Lexical Analysis Lecture 3-4 Prof. Necula CS 164 Lecture 3.
Lecture 4: Lexical Analysis & Chomsky Hierarchy
Specification of tokens using regular expressions
CSE 340 Recitation Week 3 : Sept 1st – 7th Regular Expressions
Compiler Construction
COMPILERS LECTURE(6-Aug-13)
CSC312 Automata Theory Lecture # 2 Languages.
Presentation transcript:

Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2016 Adam Doupé Arizona State University http://adamdoupe.com

Language Syntax Programming Language must have a clearly specified syntax Programmers can learn the syntax and know what is allowed and what is not allowed Compiler writers can understand programs and enforce the syntax

Language Syntax Input is a series of bytes How to get from a string of characters to program execution? We must first assemble the string of characters into something that a program can understand Output is a series of tokens

Language Syntax In English, we have an alphabet a…z, ,, ., !, ?, … However, we also have a higher abstraction than letter in the alphabet Words Defined in a dictionary Categorized into Nouns Verbs Adverbs Articles Sentences Paragraphs

Language Syntax In a programming language, we also have an alphabet (the symbols that are important in the specific language) a, z, ,, ., !, ?, <, >, ;, }, {, (, ), … Just as in English, we create abstractions of the low-level alphabet Tokens == <= while if Tokens are precisely specified using patterns

Strings Alphabet symbols together make a string We define a string over an alphabet 𝛴 as a finite sequence of symbols from 𝛴 𝜺 is the empty string, an empty sequence of symbols Concatenating 𝜺 with a string s gives s 𝜺s = s 𝜺 = s In our examples, strings will be stylized differently, either "in between double quotes" italic and dark blue

Languages 𝛴 represents the set of all symbols in an alphabet We define 𝛴* as the set of all strings over 𝛴 𝛴* contains all the strings that can be created by combining the alphabet symbols into a string A language L over alphabet 𝛴 is a set of strings over 𝛴 A language L is a subset of 𝛴* Is 𝛴 infinite? Is 𝛴* infinite? Is L infinite?

Regular Expressions Tokens are typically specified using regular expressions Regular expressions are Compact Expressive Precise Widely used Easy to generate an efficient program to match a regular expression

Regular Expressions We must first define the syntax of regular expressions A regular expression is either ∅ 𝜺 a, where a is an element of the alphabet R1 | R2, where R1 and R2 are regular expressions R1 . R2, where R1 and R2 are regular expressions (R), where R is a regular expression R*, where R is a regular expression

Regular Expressions A regular expression defines a language (the set of all strings that the regular expression describes) The language L(R) of regular expression R is given by: L(∅) = ∅ L(𝜺) = {𝜺} L(a) = {a} L(R1 | R2) = L(R1) ∪ L(R2) L(R1 . R2) = L(R1) . L(R2)

L(R1 | R2) = L(R1) ∪ L(R2) Examples: L(a | b) = L(a) ∪ L(b) = {a} ∪ {b} = {a, b} L(a | b | c) = L(a | b) ∪ L(c) = {a, b} ∪ {c} = {a, b, c} L(a | 𝜺) = L(a) ∪ L(𝜺) = {a} ∪ {𝜺} = {a, 𝜺} L(𝜺 | 𝜺) = {𝜺} {𝜺} ?= {}

L(R1 . R2) = L(R1) . L(R2) Definition For two sets A and B of strings: A . B = {xy : x ∈ A and y ∈ B} Examples: A = {aa, b }, B = {a, b} A . B = {aaa, aab, ba, bb} ab ?∈A . B A = {aa, b, 𝜺}, B = {a, b} A . B = {aaa, aab, ba, bb, a, b}

Operator Precedence L( a | b . c ) What does this mean? (a | b) . c or a | (b . c) Just like in math or a programming language, we must define the operator precedence (* higher precedence than +) a + b * c (a + b) * c or a + (b * c)? . has higher precedence than | L( a | b . c) = L(a) ∪ L(b . c) = {a} ∪ {bc} = {a, bc}

Regular Expressions L( (R) ) = L(R) L( (a | b) . c ) = L (a | b) . L (c) = {a, b} . {c} = {ac, bc}

Kleene Star L(R*) = ? L (R*) = {𝜺} ∪L(R) ∪ L(R) . L(R) ∪L(R) . L(R) . L(R) ∪ L(R) . L( R) . L(R) . L(R) … Definition L0(R) = {𝜺} Li(R) = Li-1(R) . L(R) L(R*) = ∪i≥0 Li(R)

L(R*) = ∪i≥0 Li(R) Examples L(a | b*) = {a, 𝜺, b, bb, bbb, bbbb, …} L((a | b)*) = {𝜺} ∪ {a, b} ∪ {aa, ab, ba, bb} ∪ {aaa, aab, aba, abb, baa, bab, bba, bbb} ∪…

Tokens letter = a | b | c | d | e | … | A | B | C | D | E… digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ID = letter(letter | digit | _ )* a891_jksdbed 12ajkdfjb Note that we've left out the . regular expression operator. It is implied when two regular expressions are next to each other, similar to x*y=xy in math.

Tokens How to define a number? NUM = digit* 132 𝜺 NUM = digit(digit)* 0 00000000000 pdigit = 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 NUM = pdigit(digit)*

Tokens NUM = pdigit . (digit)* | 0 123 0 00000000 1901adb

Tokens How to define a decimal number? DECIMAL = NUM . \. . NUM 1.5 2.10 1.01 DECIMAL = NUM . \. . digit* 1. DECIMAL = NUM . \. . digit . digit* 0.00 Note that here we mean a regular expression that matches the one-character string dot . However, to differentiate between the regular expression concatenation operator . and the character ., we escape the . with a \ (similar to strings where \n represents the newline character in a string). This means that we also need to escape \ with a \ so that the regular expression \\ matches the string containing the single character \

Lexical Analysis The job of the lexer is to turn a series of bytes (composed from the alphabet) into a sequence of tokens The API that we will discuss in this class will refer to the lexer as having a function called getToken(), which returns the next token from the input steam each time it is called Tokens are specified using regular expressions Bytes Tokens NUM, ID, NUM, OPERATOR, ID, DECIMAL, … Lexer Source

Lexical Analysis Given these tokens: ID = letter . (letter | digit | _ )* DOT = \. NUM = pdigit . (digit)* | 0 DECIMAL = NUM . DOT . digit . digit* What token does getToken() return on this string: 1.1abc1.2 NUM? DECIMAL? ID?

Longest Matching Prefix Rule Starting from the next input symbol, find the longest string that matches a token Break ties by giving preference to token listed first in the list

^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ String Matching Potential Longest Match 1.1abc1.2 All NUM DECIMAL, NUM NUM, 1 DECIMAL DECIMAL, 3 abc1.2 ID ID, 1 ID, 2 ID, 3 ID, 4 .2 DOT DOT, 1 2 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^

Mariner 1

Lexical Analysis In some programming languages, whitespace is not significant at all In most programming language, whitespace is not always significant ( 5 + 10 ) vs. (5+10) In Fortran, whitespace is ignored DO 15 I = 1,100 DO 15 I = 1.100 DO15I = 1.100 Variable assignment instead of a loop!