Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 3 Lexical and Syntactic Analysis Syntactic.

Copyright © 2006 The McGraw-Hill Companies, Inc. Lexical Analysis 3.1Chomsky Hierarchy of Languages 3.2Purpose of Lexical Analysis Regular Expressions regular expressions for Clite lexicon Finite State Automata (FSA) FSA as a basis for a lexical analyzer Lexical Analyzer (Lexer) Code

Copyright © 2006 The McGraw-Hill Companies, Inc. 3.1 Chomsky Hierarchy Each grammar class corresponds to a language class –Regular grammars lexical grammars –Context-free grammars programming language syntax –Context-sensitive grammars able to express some type rules –Unrestricted grammars – most powerful can express all features of languages such as C/C++

Copyright © 2006 The McGraw-Hill Companies, Inc. Chomsky Hierarchy Context sensitive and unrestricted grammars are not appropriate for developing translators –Given a terminal string ω and a context-sensitive language G it is undecidable whether ω is in the language defined by G, and it is undecidable whether L(G) has any valid strings. A problem is decidable if you can write an algorithm that is guaranteed to solve the problem in a finite number of steps.

Copyright © 2006 The McGraw-Hill Companies, Inc. Context-Free Grammars Capable of expressing concrete syntax of programming languages Equivalent to –a pushdown automaton Other grammar levels – beyond the scope of this course; see CS 403 or 603 – also correspond to theoretical machines

Copyright © 2006 The McGraw-Hill Companies, Inc. 3.2 Lexical Analysis Input: a sequence of characters (the program) Discard: whitespace, comments Output: tokens Define: A token is a logically cohesive sequence of characters representing a single symbol; e.g. –Identifiers: numberVal –Literals: 123, 5.67, 'x', true –Keywords: bool | char... –Operators: + - * /... –Punctuation: ;, ( ) { }

Copyright © 2006 The McGraw-Hill Companies, Inc. Character Sequences to Be Recognized by Clite Lexer (tokens + other) IdentifiersWhitespace: space or tab LiteralsComments: // to end-of- line KeywordsEnd-of-line Operators End-of-file Punctuation

Copyright © 2006 The McGraw-Hill Companies, Inc. Regular Expressions Regular expressions (regexp) are patterns that describe a particular class of strings –Used for pattern matching –One regexp can describe or match many strings Used in many text-processing applications –Python,Perl, Tcl, UNIX utilities such as grep all use regular expressions

Copyright © 2006 The McGraw-Hill Companies, Inc. Using Regular Expressions An alternative to regular grammars for expressing lexical syntax Lexical-analyzer generator programs (e.g. Lex) take regular expressions as input and produce C/C++ programs that tokenize text.

Copyright © 2006 The McGraw-Hill Companies, Inc. With Regular Expressions You Can http://msdn2.microsoft.com/en-us/library/101eysae(VS.80).aspx http://msdn2.microsoft.com/en-us/library/101eysae(VS.80).aspx Test for a pattern within a string (data validation) –For example, you can test an input string to see if a telephone number pattern or a credit card number pattern occurs within the string. Replace text. –Use a regular expression to identify specific text in a document and either remove it completely or replace it with other text. Extract a substring from a string based upon a pattern match. Find specific text within a document or input field.

Copyright © 2006 The McGraw-Hill Companies, Inc. Regular Expression Notation RegExprMeaning xa character x \xan escape character, e.g., \n or \t { name }a reference to a name M | NM or N M NM followed by N M*zero or more occurrences of M Red characters = metacharacters

Copyright © 2006 The McGraw-Hill Companies, Inc. RegExprMeaning M+One or more occurrences of M M?Zero or one occurrence of M [aeiou]the set of vowels/choose one (‘-’ is a metachar.) [0-9]the set of digits/choose one.Any single character (1-char wildcard) \dsame as [0-9] \wsame as [a-zA-Z0-9_] \swhitespace: [ \t\n] Differences in some representations

Copyright © 2006 The McGraw-Hill Companies, Inc. Pattern To Match a Date In the Form yyyy-mm-dd, yyyy.mm.dd, or yyyy/mm/dd (19|20)\d\d[- /.](0[1-9]|1[012])[- /.] (0[1-9]|[12][0-9]|3[01]) (19|20)\d\d : matches “19” or “20” followed by two digits [- /.] : matches ‘-’ or ‘ ‘ or ‘/’ or ‘.’ (0[1-9]|1[012]) : the first option matches a digit between 01 and 09, the second matches 10, 11 or 12. (0[1-9]|[12][0-9]|3[01]) : the 1 st option matches digits 01-09, the 2 nd 10-29, and the 3 rd matches 30 or 31.

Copyright © 2006 The McGraw-Hill Companies, Inc. Clite Lexical Syntax: Ancillary Definitions Category NameDefinition anyChar[ -~] // all printable ASCII chars; blank - tilde letter[a-zA-Z] digit[0-9] whitespace[ \t]// blank or tab eol\n eof\004

Copyright © 2006 The McGraw-Hill Companies, Inc. Clite Lexical Syntax (regexp metacharacters in red) CategoryDefinition keyword bool |char |else | false | float | if | int | main | true | while identifier{letter}({letter} | {digit})* integerLit{digit}+ floatLit{digit}+\.{digit}+ charLit‘{anyChar}’ operator: = |||| && | == |!= | | >= | + | - | * | / |! | [| ] separator: ; |. | {| } | (| ) comment: // ({anyChar} | {whitespace})* {eol}

Copyright © 2006 The McGraw-Hill Companies, Inc. Lexical Analyzer Generators Input: regular expressions Output: a lexical analyzer C/C++: Lex, Flex Java: JLex Regular grammars or regular expressions are converted to a deterministic finite state automaton (DFSA) and then to a lexical analyzer.

Copyright © 2006 The McGraw-Hill Companies, Inc. Elements of a Finite State Automata 1.Set of states: represented by graph nodes 2.Input alphabet + unique end-of-input symbol 3.State transition function represented as labelled, directed edges (arcs) connecting graph nodes 4.A unique start state 5.One or more final states

Copyright © 2006 The McGraw-Hill Companies, Inc. Deterministic FSA Definition: A finite state automaton is deterministic if for each state and each input symbol, there is at most one outgoing arc from the state labelled with the input symbol.

Copyright © 2006 The McGraw-Hill Companies, Inc. Use a DFSA to recognize (accept) or reject a string Process the string, one character at a time, by making a series of moves: –Follow the exit arc that corresponds to the leftmost input symbol, thereby consuming it. –If no such arc, then either the input is accepted (if you are in the final state) or there is an error. An input is accepted if, beginning from the start state, the automaton consumes all the input and halts in a final state.

Copyright © 2006 The McGraw-Hill Companies, Inc. Practical Issues Explicit terminator (end-of-input symbol) is used only at end of program, not each token. The symbols l and d represent an arbitrary letter and digit, respectively. An unlabelled arc represents any valid input symbol (other than those on labelled arcs leaving the same state).

Copyright © 2006 The McGraw-Hill Companies, Inc. Practical Issues When a token is recognized, move to a final state (one with no exit arc) Recognize a non-token, move back to start Recognize EOF means end of source code. Automaton must be deterministic. Recognize key words as identifiers; then do a table look-up.

Copyright © 2006 The McGraw-Hill Companies, Inc. How It’s Used The lexer is called from the parser. Parser: –Get next token –Parse next token Lexer enters Start state each time the parser calls for a new token Lexer enters “Final” state when a legal token has been recognized. The character that causes the transition to the final state may be white space; may be the first character of the next token.

Lexer Code Parser calls lexer when it needs a new token. Lexer must remember where it left off. –Sometimes the lexer gets one character ahead in the input; compare ab=13; to ab = 13 ; –In the first case, the identifier ab isn’t recognized until the next token, =, is read. –In the second case, blanks signify ends of tokens

Copyright © 2006 The McGraw-Hill Companies, Inc. Lexer Code Solutions: peek function pushback function no symbol consumed by moving out of start state; i.e., when the parser calls the lexer, the lexer already has the first character of the next token, probably in a variable ch

Copyright © 2006 The McGraw-Hill Companies, Inc. 3.2.3 - From Design to Code private char ch = ‘ ’; public Token next ( ) { do { switch (ch) {... } } while (true); } Figure 3.4: Outline of Next Token Routine

Copyright © 2006 The McGraw-Hill Companies, Inc. Remarks Exit do-while loop only when a token is found Loop exited via a return statement which returns control to the parser Variable ch must be initialized to a space character; thereafter it always holds the next character to be processed.

Copyright © 2006 The McGraw-Hill Companies, Inc. Translation Rules Pages 67,68 give rules for translating the DFSA into code. A Java Tokenizer Method for Clite is shown on page 69 (Figure 3.5) Auxiliary functions described on page 68 and 70.

Copyright © 2006 The McGraw-Hill Companies, Inc. private String concat(String set) { StringBuffer r = new StringBuffer(“”); do { r.append(ch); ch = nextChar( ); } while (set.indexOf(ch) >= 0); return r.toString( ); }

Copyright © 2006 The McGraw-Hill Companies, Inc. // bold indicates auxiliary methods public Token next( ) { do {if(isLetter(ch) {//ident or keyword String spelling=concat(letters+digits); return Token.keyword(spelling); }else if(isDigit(ch)){//numeric literal String number = concat(digits); if (ch != ‘.’) // int literal return Token.mkIntLiteral(number); number += concat(digits); return Token.mkFloatLiteral(number) ; }

Copyright © 2006 The McGraw-Hill Companies, Inc. else switch (ch) { case ‘ ‘: case ‘\t’: case ‘\r’: case eolnCh: ch = nextCh( ); break; //omitted ‘/’, comments, ‘\’ case eofCh: return Token.eofTok; case ‘+’: ch = nextChar( ); return Token.plusTok; … case ‘&’: check(‘&’); return Token.andTok; case ‘=‘: return chkOpt(‘=‘,Token.assignTok, Token.eqeqTok);

Copyright © 2006 The McGraw-Hill Companies, Inc. Source Tokens // a first program // with 3 comments int main ( ) { char c; int i; c = 'h'; i = c + 3; } // main Token TypeToken Keywordint Keywordmain Punctuation( Punctuation) Punctuation{ Keyword char Identifierc Punctuation; etc.

Copyright © 2006 The McGraw-Hill Companies, Inc. Parsing Algorithms – two types Top-down: (recursive descent, LL) –begin with the most general grammar rule (start symbol) –expand downward using more specific rules –leaves of the parse tree should match program tokens –Equivalent to a left-most derivation. Bottom-up: (LR) –start with the leaves (tokens) –group them together to form interior tree nodes, –End up at the root of the parse tree. –Equivalent to right-most derivations

Copyright © 2006 The McGraw-Hill Companies, Inc. Grammar for Parsing Example (remove recursion for recursive descent parsing) Assignment → Identifier = Expression Expression → Term { AddOp Term } AddOp → + | - Term → Factor { MulOp Factor } MulOp → * | / Factor → [ UnaryOp ] Primary UnaryOp → - | ! Primary → Identifier | Literal | ( Expression )

Copyright © 2006 The McGraw-Hill Companies, Inc. Recursive Descent Parsing A recursive descent parser “builds” the parse tree in a top-down manner Defines a method/function for each nonterminal to recognize input derivable from that nonterminal Each method should –Recognize the longest sequence of tokens (in the input stream) derivable from the non-terminal –Return an object which is the root of a subtree.

Copyright © 2006 The McGraw-Hill Companies, Inc. Auxiliary Functions for the Parser match( ) compares the current token to the expected token t If they match, get next token and return Else display a syntax error message. error( ) displays the error message and exits.

Copyright © 2006 The McGraw-Hill Companies, Inc. private String match (TokenType t) { String value = token.value(); if (token.type().equals t) token = lexer.next(); // token is a global variable else error(t); return value; }

Copyright © 2006 The McGraw-Hill Companies, Inc. Building the Parser - 1 General Idea: Begin with the start symbol For simplicity, assume a parser for assignments: Assignment → Identifier = Expression Skeleton method: private Assignment assignment( ) {... return new Assignment (... ); }

Copyright © 2006 The McGraw-Hill Companies, Inc. Building the Parser - 2 Assignment → Identifier = Expression 1.Get next token; call Match: If (Token ≠ identifier) then error else get next token 2.If (Token ≠ ‘=‘) then error else get next token 3.Call method to identify Expression Expression → Term { + | - Term }

Copyright © 2006 The McGraw-Hill Companies, Inc. Building the Parser – 3 (see code on page 79) The Expression method immediately calls Term Term → Factor { * | / Factor } The Term method then calls Factor: Factor → [UnaryOp] Primary Factor processes Unary Op (if needed) or calls Primary Primary → Identifier | Literal | ( Expression ) If (Token ≠ Identifier) and (Token ≠ Literal) and (Token ≠ ‘(‘ ) then error Else take appropriate action based on value of Token

Copyright © 2006 The McGraw-Hill Companies, Inc. Building the Parser - Summary EBNF concrete syntax rules determine the structure of the parser Parser functions return objects (Term, Assignment, etc.) which are defined according to the abstract syntax rules. Sequence of function calls plus returned objects represent the abstract syntax tree. Nodes in the abstract syntax tree are abstract syntax chunks of intermediate code.

Copyright © 2006 The McGraw-Hill Companies, Inc. Abstract Syntax Example Assignment = Variable target; Expression source Expression = Variable | Value | Binary | Unary Binary = Operator op; Expression term1, term2 Unary = Operator op; Expression term Variable = String id Value = Integer value Operator = + | - | * | / | !

Copyright © 2006 The McGraw-Hill Companies, Inc. abstract class Expression { } class Binary extends Expression { Operator op; Expression term1, term2; } class Unary extends Expression { Operator op; Expression term; }

Copyright © 2006 The McGraw-Hill Companies, Inc. private Assignment assignment( ) { // Assignment → Identifier = Expression; Variable target = new Variable(match (Token.Identifier)); match(Token.Assign); Expression source = expression( ); match(Token.Semicolon); return new Assignment(target, source);

Copyright © 2006 The McGraw-Hill Companies, Inc. Building The Abstract Syntax Tree The Assignment method returns an Assignment object which has 2 data members: –Target: a Variable object (the identifier on the LHS) –Source (an expression object) The source object is obtained by calling the Expression method

Copyright © 2006 The McGraw-Hill Companies, Inc. private Expression expression( ){ //Expression → Term{AddOp Term } Expression e = term(); while (isAddOp()) { Operator op = new Operator(match(token.type())); Expression term2 = term(); e = new Binary(op, e, term2); } return e; }

Copyright © 2006 The McGraw-Hill Companies, Inc. Building The Abstract Syntax Tree The Expression method returns an Expression object The Expression object is generated by calling Term() one or more times (depending on the expression Term( ) generates further levels in the tree.

Copyright © 2006 The McGraw-Hill Companies, Inc. Private Expression term (); { // Term  Factor{MultiplyOP Factor} Expression e = factor(); while (isMultiplyOp()) { Operator op = new Operator(match(token.type())); Expression term2 = factor(); } return e; }

Copyright © 2006 The McGraw-Hill Companies, Inc. Summary If the program is syntactically correct, parsing terminates when the eof symbol is read. The parser will have generated objects that correspond to the nodes in an abstract syntax tree. These nodes are input to the semantic analysis phase.

Copyright © 2006 The McGraw-Hill Companies, Inc. Example: Parse X = 3 * Z Abstract syntax for assignment: Assignment = Variable target; Expression source Expression = Variable | Value | Binary | Unary Binary = Operator op; Expression term1, term2 Unary = Operator op; Expression term Variable = String id Value = Integer value Operator = + | - | * | / | ! assignment variable expr The expression object is a binary generated by calling term( ) once (to get 3*z), factor( ) and primary( ) twice (to get 3 and z)

Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 3 Lexical and Syntactic Analysis Syntactic.

Similar presentations

Presentation on theme: "Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 3 Lexical and Syntactic Analysis Syntactic."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 3 Lexical and Syntactic Analysis Syntactic.

Similar presentations

Presentation on theme: "Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 3 Lexical and Syntactic Analysis Syntactic."— Presentation transcript:

Similar presentations

About project

Feedback