Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 3304 Comparative Languages

Similar presentations


Presentation on theme: "CS 3304 Comparative Languages"— Presentation transcript:

1 CS 3304 Comparative Languages
Lecture 3: Scanning 24 January 2012

2 Introduction Syntax: the form or structure of the expressions, statements, and program units. Semantics: the meaning of the expressions, statements, and program units. Syntax and semantics provide a language’s definition. Users of a language definition: Other language designers. Implementers. Programmers (the users of the language). Basic terminology: A sentence is a string of characters over some alphabet. A language is a set of sentences. A lexeme is the lowest level syntactic unit of a language. A token is a category of lexemes (e.g., identifier).

3 Defining Languages Recognizers: Generators:
A recognition device reads input strings over the alphabet of the language and decides whether the input strings belong to the language. Example: syntax analysis part of a compiler (scanning). Generators: A device that generates sentences of a language. One can determine if the syntax of a particular sentence is syntactically correct by comparing it to the structure of the generator.

4 BNF Fundamentals In BNF, abstractions are used to represent classes of syntactic structures: they act like syntactic variables (also called nonterminal symbols, or just nonterminals). Terminals are lexemes or tokens. A rule has a left-hand side (LHS), which is a nonterminal, and a right-hand side (RHS), which is a string of terminals and/or nonterminals. Nonterminals are often italic or enclosed in angle brackets. Examples of BNF rules: <ident_list> → identifier | identifier, <ident_list> <if_stmt> → if <logic_expr> then <stmt> Grammar: a finite non-empty set of rules. A start symbol is a special element of the nonterminals of a grammar.

5 Scanner A scanner is responsible for: Tokenizing source.
Removing comments. (Often) dealing with pragmas (i.e., significant comments). Saving text of identifiers, numbers, strings. Saving source locations (file, line, column) for error messages.

6 Scanning Example I Suppose we are building an ad-hoc (hand-written) scanner for Pascal: We read the characters one at a time with look-ahead. If it is one of the one-character tokens: { ( ) [ ] < > , ; = + - } etc. we announce that token. If it is a ., we look at the next character: If that is a dot, we announce . Otherwise, we announce . and reuse the look-ahead.

7 Scanning Example II If it is a <, we look at the next character
if that is a = we announce <= otherwise, we announce < and reuse the look-ahead, etc. If it is a letter, we keep reading letters and digits and maybe underscores until we can't anymore: Then we check to see if it is a reserved word. If it is a digit, we keep reading until we find a non-digit: If that is not a . we announce an integer. Otherwise, we keep looking for a real number. If the character after the . is not a digit we announce an integer and reuse the . and the look-ahead.

8 Deterministic Finite Automaton
Pictorial representation of a scanner for calculator tokens, in the form of a finite automaton. This is a deterministic finite automaton (DFA): Lex, scangen, ANTLR, etc. build these things automatically from a set of regular expressions. Specifically, they construct a machine that accepts the language.

9 The Longest Possible Token Rule
We scan over and over to get one token after another. Nearly universal rule: always take the longest possible token from the input, thus: foobar is foobar and never f or foo or foob. The rule means you return only when the next character can't be used to continue the current token: The next character will generally be saved for the next token. In some cases, you may need to peek at more than one character of look-ahead in order to know whether to proceed: In Pascal, for example, when you have a 3 and you a see a dot Do you proceed (in hopes of getting 3.14)? or Do you stop (in fear of getting 3..5)? Regular expressions "generate" a regular language. DFAs "recognize” a regular language.

10 Building Scanners Scanners tend to be built three ways:
Ad-hoc. Semi-mechanical pure DFA (usually as nested case statements). Table-driven DFA. Ad-hoc generally yields the fastest, most compact code by doing lots of special-purpose things, though good automatically-generated scanners come very close. Writing a pure DFA as a set of nested case statements is a surprisingly useful programming technique (Figure 12.1): It is often easier to use perl, awk, sed or similar tools. Table-driven DFA is what lex and scangen produce: lex (flex): C code scangen: numeric tables and a separate driver (Figure 2.12). ANTLR: Java code.

11 Summary BNF and context-free grammars are equivalent meta- languages that are well-suited for describing the syntax of programming languages. Syntax analysis is a common part of language implementation Scanners (lexical analyzers) use pattern matching to isolate small-scale parts of a program. ANTLR provides supports for scanners (lexers), parsers, and tree-parsers.


Download ppt "CS 3304 Comparative Languages"

Similar presentations


Ads by Google