Presentation is loading. Please wait.

Presentation is loading. Please wait.

Professor Yihjia Tsai Tamkang University

Similar presentations


Presentation on theme: "Professor Yihjia Tsai Tamkang University"— Presentation transcript:

1 Professor Yihjia Tsai Tamkang University
Using JavaCC Professor Yihjia Tsai Tamkang University

2 Automating Lexical Analysis Overall picture
Scanner generator NFA RE Java scanner program String stream DFA Minimize DFA Simulate DFA Tokens

3 Building Faster Scanners from the DFA
Table-driven recognizers waste a lot of effort Read (& classify) the next character Find the next state Assign to the state variable Branch back to the top We can do better Encode state & actions in the code Do transition tests locally Generate ugly, spaghetti-like code (it is OK, this is automatically generated code) Takes (many) fewer operations per input character state = s0 ; string = ; char = get_next_char(); while (char != eof) { state = (state,char); string = string + char; char = get_next_char(); } if (state in Final) then report acceptance; else report failure;

4 Inside lexical analyzer generator
How does a lexical analyzer work? Get input from user who defines tokens in the form that is equivalent to regular grammar Turn the regular grammar into a NFA Convert the NFA into DFA Generate the code that simulates the DFA

5 Flow for Using JavaCC Extracted from

6 Structure of a JavaCC File
A JavaCC file is composed of 3 portions: Options Class declaration Specification for lexical analysis (tokens), and specification for syntax analysis. For the very first example of JavaCC, let's recognize two tokens: ``+'', and numerals. Use an editor to edit and save it with file name numeral.jj Focus of this Lecture

7 Using javaCC for lexical analysis
javacc is a “top-down” parser generator. Some parser generators (such as yacc , bison, and JavaCUP) need a separate lexical-analyzer generator. With javaCC, you can specify the tokens within the parser generator.

8 Example File /* main class definition */ PARSER_BEGIN(Numeral)
public class Numeral{ public static void main(String[] args) throws ParseException, TokenMgrError { Numeral numeral = new Numeral(System.in); while (numeral.getNextToken().kind!=EOF); } PARSER_END(Numeral) /* token definitions */ TOKEN: { <ADD: "+"> | <NUMERAL: (["0"-"9"])+>

9 Options The options portion is optional and is omitted in the previous example. STATIC is a boolean option whose default value is true. If true, all methods and class variables are specified as static in the generated parser and token manager. This allows only one parser object to be present, but it improves the performance of the parser. To perform multiple parses during one run of your Java program, you will have to call the ReInit() method to reinitialize your parser if it is static. If the parser is non-static, you may use the "new" operator to construct as many parsers as you wish. These can all be used simultaneously from different threads.

10 Start /* main class definition */ PARSER_BEGIN(Numeral)
public class Numeral{ public static void main(String[] args) throws ParseException, TokenMgrError { Numeral numeral = new Numeral(System.in); while (numeral.getNextToken().kind!=EOF); } PARSER_END(Numeral) /* token definitions */ TOKEN: { <ADD: "+"> | <NUMERAL: (["0"-"9"])+> Simple Loop Getting Tokens

11 Compilation After calling javacc to compile numeral.jj, eight files are generated if no error messages occur. They are Numeral.java, NumberalConstants.java, NumeralTokenManger.java, ParseException.java, SimpleCharStream.java, Token.java, and TokenMgrError.java. bash-2.05$ javacc numeral.jj Java Compiler Compiler Version 3.2 (Parser Generator) (type "javacc" with no arguments for help) Reading from file numeral.jj . . . File "TokenMgrError.java" does not exist. Will create one. File "ParseException.java" does not exist. Will create one. File "Token.java" does not exist. Will create one. File "SimpleCharStream.java" does not exist. Will create one. Parser generated successfully

12 javaCC specification of a lexer
Note the need for ( )! Defining Whitespace

13 A Full Example See the sample file

14 Dealing with errors Error reporting: 123e+q
Could consider it an invalid token (lexical error) or return a sequence of valid tokens 123, e, +, q, and let the parser deal with the error.

15 Lexical error correction?
Sometimes interaction between the Scanner and parser can help especially in a top-down (predictive) parse The parser, when it calls the scanner, can pass as an argument the set of allowable tokens. Suppose the Scanner sees calss in a context where only a top-level definition is allowed. Not too hard to guess what is meant. Scanner can guess that class was intended, generate a warning message, and return to parsing. Why should a compiler halt if it can figure out, with high probability, what was intended? Most lexical errors are character insertion, deletion, replacement, or transposition. PLC compiler from mid 1970’s did this well.

16 Same symbol, different meaning.
How can the scanner distinguish between binary minus and unary minus? x = -a; vs x = 3 – a; It can’t. It has to simply pass MINUS back to the parser and let it distinguish.

17 Scanner “troublemakers”
Unclosed strings Unclosed comments.

18 JavaCC as a Parsing Tool

19 Javacc Overview Generates a top down parser.
Could be used for generating a Prolog parser which is in LL. Generates a parser in Java. Hence can be integrated with any Java based Prolog compiler/interpreter to continue our example. Token specification and grammar specification structures are in the same file => easier to debug.

20 Types of Productions in Javacc
There can be four different kinds of Productions. Javacode For something that is not context free or is difficult to write a grammar for. eg) recognizing matching braces and error processing. Regular Expressions Used to describe the tokens (terminals) of the grammar. BNF Standard way of specifying the productions of the grammar. Token Manager Declarations The declarations and statements are written into the generated Token Manager (lexer) and are accessible from within lexical actions.

21 Javacc Look-ahead mechanism
Exploration of tokens further ahead in the input stream. Backtracking is unacceptable due to performance hit. By default Javacc has 1 token look-ahead. Could specify any number for look-ahead. Two types of look-ahead mechanisms Syntactic A particular token is looked ahead in the input stream. Semantic Any arbitrary Boolean expression can be specified as a look-ahead parameter. eg) A -> aBc and B -> b ( c )? Valid strings: “abc” and “abcc”

22 References Compilers Principles, Techniques and Tools, Aho, Sethi, and Ullman


Download ppt "Professor Yihjia Tsai Tamkang University"

Similar presentations


Ads by Google