Lecture 2: Lexical Analysis

Lecture 2: Lexical Analysis

Lexical Analysis INPUT: sequence of characters
OUTPUT: sequence of tokens A lexical analyzer is generally a subroutine of parser: Simpler design Efficient Portable Next_token() Next_char() Input Scanner Parser character token Symbol Table

Definitions token – set of strings defining an atomic element with a defined meaning pattern – a rule describing a set of string lexeme – a sequence of characters that match some pattern

Examples Token Pattern Sample Lexeme while relation_op
= | != | < | > < integer (0-9)* 42 string Characters between “ “ “hello”

Input string: size := r * 32 + c
<token,lexeme> pairs: <id, size> <assign, :=> <id, r> <arith_symbol, *> <integer, 32> <arith_symbol, +> <id, c>

Implementing a Lexical Analyzer
Practical Issues: Input buffering Translating RE into executable form Must be able to capture a large number of tokens with single machine Interface to parser Tools

Capturing Multiple Tokens
Capturing keyword “begin” b e g i n WS WS – white space A – alphabetic AN – alphanumeric Capturing variable names A WS AN What if both need to happen at the same time?

Capturing Multiple Tokens
b e g i n WS A-b WS – white space A – alphabetic AN – alphanumeric AN AN WS Machine is much more complicated – just for these two tokens!

Lex – Lexical Analyzer Generator
specification Flex/Lex lex.yy.c C/C++ compiler a.out input tokens

Lex Specification Definitions – Code, RE Rules – RE/Action pairs
%{ int charCount=0, wordCount=0, lineCount=0; %} word [^ \t\n]* %% {word} {wordCount++; charCount += yyleng; } [\n] {charCount++; lineCount++;} . {charCount++;} main() { yylex(); printf(“Characters %d, Words: %d, Lines: %d\n”,charCount, wordCount, lineCount); } Definitions – Code, RE Rules – RE/Action pairs User Routines

Lex definitions section
%{ int charCount=0, wordCount=0, lineCount=0; %} word [^ \t\n]* C/C++ code: Surrounded by %{… %} delimiters Declare any variables used in actions RE definitions: Define shorthand for patterns: digit [0-9] letter [a-z] ident {letter}({letter}|{digit})* Use shorthand in RE section: {ident}

Lex Regular Expressions
{word} {wordCount++; charCount += yyleng; } [\n] {charCount++; lineCount++;} . {charCount++;} Match explicit character sequences integer, “+++”, \<\> Character classes [abcd] [a-zA-Z] [^0-9] – matches non-numeric

Lex Regular Expressions(cont.)
Alternation twelve | 12 Closure * - zero or more + - one or more ? – zero or one {number}, {number,number}

Lex Regular Expressions(cont.)
Other operators . – matches any character except newline ^ - matches beginning of line $ - matches end of line / - trailing context () – grouping {} – RE definitions

Lex Matching Rules Lex always attempts to match the longest possible string. If two rules are matched (and match strings are same length), the first rule in the specification is used.

Lex Operators Highest: closure concatenation alternation
Special lex characters: - \ / * + > “ { } . $ ( ) | % [ ] ^ Special lex characters inside [ ]: - \ [ ] ^

Examples a.*z (ab)+ [0—9]{1,5} (ab|cd)?ef = abef,cdef,ef
-?[0-9]\.[0-9]

Lex Actions Lex actions are C (C++) code to implement some required functionality Default action is to echo to output Can ignore input (empty action) ECHO – macro that prints out matched string yytext – matched string yyleng – length of matched string

User Subroutines C/C++ code Copied directly into the lexer code
main() { yylex(); printf(“Characters %d, Words: %d, Lines: %d\n”,charCount, wordCount, lineCount); } C/C++ code Copied directly into the lexer code User can supply ‘main’ or use default

Uses for Lex Transforming Input – convert input from one form to another (example 1). yylex() is called once; return is not used in specification Extracting Information – scan the text and return some information (example 2). yylex() is called once; return is not used in specification. Extracting Tokens – standard use with compiler (example 3). Uses return to give the next token to the caller.

Regular expression A regular expression is a kind of pattern that can be applied to text (Strings, in Java) A regular expression either matches the text (or part of the text), or it fails to match. Regular expressions are an extremely useful tool for manipulating text Regular expressions are heavily used in the automatic generation of Web pages

Pattern matching applications:
Scan for virus signatures Process natural languages Search for information using Google Search and replace in word processors Filter text( spam, malware ) Validate data-entry field (dates, , url)

Basic Operation Notation to specify a set of strings

Regular expression : examples
Notation is surprisingly expressive. Regular Expression Yes No a* | (a*ba*ba*ba*)* multiple of 3 b's ε bbb aaa abbbaaa bbbaababbaa b bb abbaaaa baabbbaa a | a(a|b)*a begins and ends with a a aba aa abbaabba ε ab ba (a|b)*abba(a|b)* contains the substring abba abba bbabbabb abbaabba ε abb bbaaba

Using Regular expression
Built in to Java, Perl , PHP, Unix, .NET,…. Additional operations typically aded for convenience. Ex. [a-e]+ is shorthand for (a|b|c|d|e) (a|b|c|d|e)* Operation Regular Expression Yes No Concatenation hello Othello say hello Hello Any single character ..oo..oo. bloodroot spoonfood cookbook choochoo

Using Regular expression
Operation Regular Expression Yes No Replication a(bc)*de ade abcde abcbcde abc One or more a(bc)+de abcde abcbcde ade abc Once or not at all a(bc)?de ade abcde abc abcbcde Character classes [a-m]* blackmail imbecile above below Negation of character classes [^aeiou] b c a e Exactly N times [^aeiou]{6} rhythm syzygy rhythms allowed Between M and N times [a-z]{4,6} spider tiger jellyfish cow Whitespace characters [a-z\s]*hello hello say hello Othello 2hello

Lecture 2: Lexical Analysis

Similar presentations

Presentation on theme: "Lecture 2: Lexical Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 2: Lexical Analysis

Similar presentations

Presentation on theme: "Lecture 2: Lexical Analysis"— Presentation transcript:

Similar presentations

About project

Feedback