Presentation is loading. Please wait.

Presentation is loading. Please wait.

COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.

Similar presentations


Presentation on theme: "COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions."— Presentation transcript:

1 COMP313A Programming Languages Lexical Analysis

2 Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions

3 Lexical Analysis Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the most powerful technique that works And nothing more –No parsing sledgehammers for lexical nuts –Portability More modular code More code re-use

4 Source Code Characteristics Code –Identifiers Count, max, get_num –Language keywords switch, if.. then.. else, printf, return, void Mathematical operators –+, *, >> …. –<=, =, != … –Literals “Hello World” Comments Whitespace

5 Language of Lexical Analysis Tokens Patterns Lexemes

6 Tokens are not enough… Clearly, if we replaced every occurrence of a variable with a token then …. We would lose other valuable information Other data items are attributes of the tokens Stored in the symbol table

7 Token delimiters When does a token/lexeme end? e.g xtemp=ytemp

8 Ambiguity in identifying tokens A programming language definition will state how to resolve uncertain token assignment <> Is it 1 or 2 tokens? Disambiguating rules state what to do Reserved keywords (e.g. if) take precedence over identifiers ‘Principle of longest substring’

9 Regular Expressions To represent patterns of strings of characters REs –Alphabet – set of legal symbols –Meta-characters – characters with special meanings  is the empty string 3 basic operations –Choice – choice1|choice2, a|b matches either a or b –Concatenation – firstthing secondthing (a|b)c matches the strings { ac, bc } –Repetition (Kleene closure)– repeatme* a* matches { , a, aa, aaa, aaaa, ….} Precedence: * is highest, | is lowest –Thus a|bc* is a|(b(c*))

10 Regular Expressions… We can add in regular definitions –digit = 0|1|2 …|9 And then use them: –digit digit* A sequence of 1 or more digits One or more repetitions: –(a|b)(a|b)*  (a|b)+ Any character in the alphabet. –.*b.* - strings containing at least one b Ranges [a-z], [a-zA-Z], [0-9], (assume character set ordering) Not: ~a or [^a]

11 Some exercises Describe the languages denoted by the following regular expressions 1.0 ( 0 | 1 ) * 0 2.( (  | 0 ) * ) * 3.0* 1 0* 1 0* 1 0 * Write regular definitions for the following regular expressions 1.All strings that contain the five vowels in order 2.All strings of letters in which the letters are in ascending lexicographic order 3.All strings of 0’s and 1’s that do not contain the substring 011

12 Some exercises Write a regular expression for C/C++ integers Write a regular expression for C/C++ identifiers Write a regular expression for C/C++ numbers

13 Limitations of REs REs can describe many language constructs but not all For example Alphabet = {a,b}, describe the set of strings consisting of a single a surrounded by an equal number of b’s S= {a, bab, bbabb, bbbabbb, …}

14 Lookahead, < When we read a token delimiter to establish a token we need to make sure that it is still available –It is the start of the next token! This is lookahead –Decide what to do based on the character we ‘haven’t read’ Sometimes implemented by reading from a buffer and then pushing the input back into the buffer And then starting with recognizing the next token

15 Classic Fortran example DO 99 I=1,10 becomes DO99I=1,10 versus DO99I=1.10 When can the lexical analyzer assign a token? Push back into input buffer –or ‘backtracking’


Download ppt "COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions."

Similar presentations


Ads by Google