Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to JavaCC

Similar presentations


Presentation on theme: "Introduction to JavaCC"— Presentation transcript:

1 Introduction to JavaCC
Cheng-Chia Chen

2 What is a parser generator
l = p r i c e + x ; Scanner Total = price + tax ; Parser assignment Total = Expr Parser generator (JavaCC) id id price tax lexical+grammar specification

3 JavaCC JavaCC (Java Compiler Compiler) is a scanner and parser generator; Produce a scanner and/or a parser written in java, itself is also written in Java; There are many parser generators. yacc (Yet Another Compiler-Compiler) for C programming language (dragon book chapter 4.9); Bison from gnu.org There are also many parser generators written in Java JavaCUP; ANTLR; SableCC

4 More on classification of java parser generators
Bottom up Parser Generators Tools JavaCUP; jay, YACC for Java SableCC, The Sable Compiler Compiler Topdown Parser Generators Tools ANTLR, Another Tool for Language Recognition JavaCC, Java Compiler Compiler javacc.dev.java.net

5 Features of JavaCC TopDown LL(K) parser genrator
Lexical and grammar specifications in one file Tree Building preprocessor with JJTree Extreme Customizable many different options selectable Document Generation by using JJDoc Internationalized can handle full unicode Syntactic and Semantic lookahead

6 Features of JavaCC (cont’d)
Permits extneded BNF specifications can use | * ? + () at RHS. Lexical states and lexical actions Case-sensitive/insensitive lexical analysis Extensive debugging capability Special tokens Very good error reporting

7 JavaCC Installation Download the file javacc-3.X.zip from unzip javacc-3.X.zip to a directory %JCC_HOME% add %JCC_HOME\bin directory to your %path%. javacc, jjtree, jjdoc are now invokable directly from the command line.

8 Steps to use JavaCC Write a javaCC specification (.jj file)
Defines the grammar and actions in a file (say, calc.jj) Run javaCC to generate a scanner and a parser javacc calc.jj Will generate parser, scanner, token,… java sources Write your program that uses the parser For example, UseParser.java Compile and run your program javac -classpath . *.java java -cp . mainpackage.MainClass

9 Example 1: parse a spec of regular expressions and match it with input strings
Grammar : re.jj Example % all strings ending in "ab" (a|b)*ab; aba; ababb; Our tasks: For each input string (Line 3,4) determine whether it matches the regular expression (line 2).

10 the overall picture REParserTokenManager javaCC REParser re.jj tokens
result MainClass % comment (a|b)*ab; a; ab;

11 Class diagram (to be added)

12 Format of a JavaCC input Grammar
javacc_options PARSER_BEGIN ( <IDENTIFIER>1 ) java_compilation_unit PARSER_END ( <IDENTIFIER>2 ) ( production )*

13

14 the input spec file (re.jj)
options { USER_TOKEN_MANAGER=false; BUILD_TOKEN_MANAGER=true; OUTPUT_DIRECTORY="./reparser"; STATIC=false; }

15

16 re.jj PARSER_BEGIN(REParser) package reparser; import java.lang.*; …
import dfa.*; public class REParser { public FA tg = new FA(); // output error message with current line number public static void msg(String s) { System.out.println("ERROR"+s); } public static void main(String args[]) throws Exception { REParser reparser = new REParser(System.in); reparser.S(); PARSER_END(REParser)

17 re.jj (Token definition)
<SYMBOL: ["0"-"9","a"-"z","A"-"Z"] > | <EPSILON: "epsilon" > | <LPAREN: "(“ > | <RPAREN: ")“ > | <OR: "|" > | <STAR: "*“ > | <SEMI: ";“ > } SKIP: { < ( [" ","\t","\n","\r","\f"] )+ > | < "%" ( ~ ["\n"] )* "\n" > { System.out.println(image); }

18 re.jj (productions) void S() : { FA d1; } { d1 = R() <SEMI>
{ tg = d1; System.out.println("------NFA"); tg.print(); System.out.println("------DFA"); tg = tg.NFAtoDFA(); tg.print(); System.out.println("------Minimize"); tg = tg.minimize(); tg.print(); System.out.println("------Renumber"); tg=tg.renumber(); tg.print(); System.out.println("------Execute"); } testCases()

19 re.jj void testCases() : {} { (testCase() )+ }
void testCase(): { String testInput ;} { testInput = symbols() <SEMI> { tg.execute( testInput) ; } } String symbols() : {Token token = null; StringBuffer result = new StringBuffer(); } { ( token = <SYMBOL> { result.append( token.image) ; } )* { return result.toString(); }

20 re.jj (regular expression)
// R --> RUnit | RConcat | RChoice FA R() : {FA result ;} { result = RChoice() { return result; } } FA RUnit() : { FA result ; Token d1; } { ( <LPAREN> result = RChoice() <RPAREN> | <EPSILON> { result = tg.epsilon(); } d1 = <SYMBOL> { result = tg.symbol( d1.image ); } ) { return result ; } }

21 re.jj FA RChoice() : { FA result, temp ;} { result = RConcat()
( <OR> temp = RConcat() { result = result.choice( temp ) ;} )* {return result ; } } FA RConcat() : { FA result, temp ;} { result = RStar() ( temp = RStar() { result = result.concat( temp ) ;} )* {return result ; } } FA RStar() : {FA result;} { result = RUnit() ( <STAR> { result = result.closure(); } )* { return result; } }

22 Format of a JavaCC input Grammar
javacc_input ::= javacc_options PARSER_BEGIN ( <IDENTIFIER>1 ) java_compilation_unit PARSER_END ( <IDENTIFIER>2 ) ( production )* <EOF> color usage: blue --- nonterminal <orange> – a token type purple token lexeme ( reserved word; I.e., consisting of the literal itself.) black -- meta symbols

23 Notes <IDENTIFIER> means any Java identifers like var, class2, …
IDENTIFIER means IDENTIFIER only. <IDENTIFIER>1 must = <IDENTIFIER>2 java_compilation_unit is any java code that as a whole can appear legally in a file. must contain a main class declaration with the same name as <IDENTIFIER>1 . Ex: PARSER_BEGIN ( MyParser ) package mypackage; import myotherpackage….; public class MyParser { … } class MyOtherUsefulClass { … } … PARSER_END (MyParser)

24 The input and output of javacc
(MyLangSpec.jj ) Token.java PARSER_BEGIN ( MyParser ) package mypackage; import myotherpackage….; public class MyParser { … } class MyOtherUsefulClass { … } … PARSER_END (MyParser) ParserError.java MyParser.java MyParserTokenManager.java MyParserCostant.java

25 Notes: Token.java and ParseError.jar are the same for all input and can be reused. package declaration in *.jj are copied to all 3 outputs. import declarations in *.jj are copied to the parser and token manager files. parser file is assigned the file name <IDENTIFIER>1 .java The parser file has contents: …class MyParser { … //generated parser is inserted here. … } The generated token manager provides one public method: Token getNextToken() throws ParseError;

26 Lexical Specification with JavaCC

27 javacc options javacc_options ::= [ options { ( option_binding )* } ]
option_binding are of the form : <IDENTIFIER>3 = <java_literal> ; where <IDENTIFIER>3 is not case-sensitive. Ex: options { USER_TOKEN_MANAGER=true; BUILD_TOKEN_MANAGER=false; OUTPUT_DIRECTORY="./sax2jcc/personnel"; STATIC=false; }

28 More Options LOOKAHEAD CHOICE_AMBIGUITY_CHECK OTHER_AMBIGUITY_CHECK
java_integer_literal (1) CHOICE_AMBIGUITY_CHECK java_integer_literal (2) for A | B … | C OTHER_AMBIGUITY_CHECK java_integer_literal (1) for (A)*, (A)+ and (A)? STATIC (true) DEBUG_PARSER (false) DEBUG_LOOKAHEAD (false) DEBUG_TOKEN_MANAGER (false) OPTIMIZE_TOKEN_MANAGER java_boolean_literal (false) OUTPUT_DIRECTORY (current directory) ERROR_REPORTING (true)

29 More Options JAVA_UNICODE_ESCAPE (false) UNICODE_INPUT (false)
replace \u2245 to actual unicode (6 char  1 char) UNICODE_INPUT (false) input strearm is in unicode form IGNORE_CASE (false) USER_TOKEN_MANAGER (false) generate TokenManager interface for user’s own scanner USER_CHAR_STREAM (false) generate CharStream.java interface for user’s own inputStream BUILD_PARSER (true) java_boolean_literal BUILD_TOKEN_MANAGER (true) SANITY_CHECK (true) FORCE_LA_CHECK (false) COMMON_TOKEN_ACTION (false) invoke void CommonTokenAction(Token t) after every getNextToken() CACHE_TOKENS (false)

30 Example: Figure 2.2 if IF [a-z][a-z0-9]* ID [0-9]+ NUM
([0-9]+”.”[0-9]*) | ([0-9]*”.”[0-9]+) REAL (“--”[a-z]*”\n”) | (“ “|”\n” | “\t” ) nonToken, WS error javacc notations  “if” or “i” “f” or [“i”][“f”] [“a”-”z”]([“a”-”z”,”0”-”9”])* ([“0”-”9”])+ ([“0”-”9”])+ “.” ( [“0”-”9”] ) * | ([“0”-”9”])* ”.” ([“0”-”9”])+

31 JvaaCC spec for the tokens from Fig 2.2
PARSER_BEGIN(MyParser) class MyParser{} PARSER_END(MyParser) /* For the regular expressin on the right, the token on the left will be returned */ TOKEN : { < IF: “if” > | < #DIGIT: [“0”-”9”] > |< ID: [“a”-”z”] ( [“a”-”z”] | <DIGIT>)* > |< NUM: (<DIGIT>)+ > |< REAL: ( (<DIGIT>)+ “.” (<DIGIT>)* ) | ( <DIGIT>+ “.” (<DIGIT>)* ) > }

32 JvaaCC spec for the tokens from Fig 2.2 (continued)
/* The regular expression here will be skipped during lexical analysis */ SKIP : { < “ “> | <“\t”> |<“\n”> } /* like SKIP but skipped text accessible from parser action */ SPECIAL_TOKEN : { <“--” ([“a”-”z”])* (“\n” | “\r” | “\n\r” ) > } /* . For any substring not matching lexical spec, javacc will throw an error */ /* main rule */ void start() : {} { (<IF> | <ID> |<NUM> |<REAL>)* }

33

34 Grammar Specification with JavaCC

35 The Form of a Production
java_return_type java_identifier ( java_parameter_list ) : java_block { expansion_choices } EX : void XMLDocument(Logger logger): { int msg = 0; } { <StartDoc> { print(token); } Element(logger) <EndDoc> { print(token); } | else() }

36 Example ( Grammar 3.30 ) P  L S  id := id S  while id do S
S  begin L end S if id then S S  if id then S else S L S L L;S 1,7,8 : P  S (;S)*

37 JavaCC Version of Grammar 3.30
PARSER_BEGIN(MyParser) pulic class MyPArser{} PARSRE_END(MyParser) SKIP : {“ “ | “\t” | “\n” } TOKEN: { <WHILE: “while”> | <BEGIN: “begin”> | <END:”end”> | <DO:”do”> | <IF:”if”> | <THEN : “then”> | <ELSE:”else”> | <SEMI: “;”> | <ASSIGN: “=“> |<#LETTER: [“a”-”z”]> | <ID: <LETTER>(<LETTER> | [“0”-”9”] )* > }

38 JavaCC Version of Grammar 3.30 (cont’d)
void Prog() : { } { StmList() <EOF> } void StmList(): { } { Stm() (“;” Stm() ) * } void Stm(): { } { <ID> “=“ <ID> | “while” <ID> “do” Stm() | <BEGIN> StmList() <END> | “if” <ID> “then” Stm() [ LOOKAHEAD(1) “else” Stm() ]

39 Types of producitons production ::= javacode_production
| regulr_expr_production | bnf_production | token_manager_decl Note: 1,3 are used to define grammar. 2 is used to define tokens 4 is used to embeded codes into token manager.

40 JAVACODE production javacode_production ::= “JAVACODE”
java-return_type iava_id “(“ java_param_list “)” java_block Note: Used to define nonterminals for recognizing sth that is hard to parse using normal production.

41 Example JAVACODE JAVACODE void skip_to_matching_brace() { Token tok;
int nesting = 1; while (true) { tok = getToken(1); if (tok.kind == LBRACE) nesting++; if (tok.kind == RBRACE) { nesting--; if (nesting == 0) break; } tok = getNextToken(); } }

42 Note: Do not use nonterminal defined by JAVACODE at choice point without giving LOOKHEAD. void NT() : {} { skip_to_matching_brace() | some_other_production() } "{" skip_to_matching_brace() | "(" parameter_list() ")"

43

44 TOKEN_MANAGER_DECLS token_manager_decls ::=
TOKEN_MGR_DECLS : java_block The token manager declarations starts with the reserved word "TOKEN_MGR_DECLS" followed by a ":" and then a set of Java declarations and statements (the Java block). These declarations and statements are written into the generated token manager (MyParserTokenManager.java) and are accessible from within lexical actions. There can only be one token manager declaration in a JavaCC grammar file.

45 regular_expression_production
regular_expr_production ::= [ lexical_state_list ] regexpr_kind [ [ IGNORE_CASE ] ] : { regexpr_spec ( | regexpr_spec )* } regexpr_kind::= TOKEN | SPECIAL_TOKEN | SKIP | MORE TOKEN is used to define normal tokens SKIP is used to define skipped tokens (not passed to later parser) MORE is used to define semi-tokens (I.e. only part of a token). SPECIAL_TOKEN is between TOKEN and SKIP tokens in that it is passed on to the parser and accessible to the parser action but is ignored by production rules (not counted as an token). Useful for representing comments.

46 lexical_state_list lexical_state_list::=
< * > | < java_identifier ( , java_identifier )* > The lexical state list describes the set of lexical states for which the corresponding regular expression production applies. If this is written as "<*>", the regular expression production applies to all lexical states. Otherwise, it applies to all the lexical states in the identifier list within the angular brackets. if omitted, then a DEFAULT lexical state is assumed.

47 regexpr_spec regexpr_spec::=
regular_expression1 [ java_block ] [ : java_identifier ] Meaning: When a regular_expression1 is matched then if java_block exists then execute it if java_identifier appears, then transition to that lexical state.

48 regular_expression regular_expression ::= java_string_literal
| < [ [#] java_identifier : ] complex_regular_expression_choices > | <java_identifier> | <EOF> <EOF> is matched by end-of-file character only. (3) <java_identifier> is a reference to other labeled regular_expression. used in bnf_production java_string_literal is matched only by the string denoted by itself. (2) is used to defined a labled regular_expr and not visible to outside the current TOKEN section if # occurs. (1) for unnamed tokens

49 Example <DEFAULT, LEX_ST2> TOKEN IGNORE_CASE : {
< FLOATING_POINT_LITERAL: (["0"-"9"])+ "." (["0"-"9"])* (<EXPONENT>)? (["f","F","d","D"])? | "." (["0"-"9"])+ (<EXPONENT>)? (["f","F","d","D"])? | (["0"-"9"])+ <EXPONENT> (["f","F","d","D"])? | (["0"-"9"])+ (<EXPONENT>)? ["f","F","d","D"] > { // do Something } : LEX_ST1 | < #EXPONENT: ["e","E"] (["+","-"])? (["0"-"9"])+ > } Note: if # is omitted, E123 will be recognized erroneously as a token of kind EXPONENT.

50 Structure of complex_regular_expression
complex_regular_expression_choices::= complex_regular_expression (| complex_regular_expression )* complex_regular_expression ::= ( complex_regular_expression_unit )* complex_regular_expression_unit ::= java_string_literal | < java_identifier > | character_list | ( complex_regular_expression_choices ) [+|*|?] Note: unit concatenation;juxtaposition complex_regular_expression choice; |  complex_regular_expression_choice (.)[+|*|?]  unit

51 character_list character_list::=
[~] [ [ character_descriptor ( , character_descriptor )* ] ] character_descriptor::= java_string_literal [ - java_string_literal ] java_string_literal ::= // reference to java grammar “ singleCharString* “ note: java_sting_literal here is restricted to length 1. ex: ~[“a”,”b”] --- all chars but a and b. [“a”-”f”, “0”-”9”, “A”,”B”,”C”,”D”,”E”,”F”] --- hexadecimal digit. [“a”,”b”]+ is not a regular_expression_unit. Why ? should be written ( [“a”,”b”] )+ instead.

52 bnf_production bnf_production::=
java_return_type java_identifier "(" java_parameter_list ")" ":" java_block "{" expansion_choices "}“ expansion_choices::= expansion ( "|" expansion )* expansion::= ( expansion_unit )*

53 expansion_unit expansion_unit::= local_lookahead | java_block
| "(" expansion_choices ")" [ "+" | "*" | "?" ] | "[" expansion_choices "]" | [ java_assignment_lhs "=" ] regular_expression | [ java_assignment_lhs "=" ] java_identifier "(" java_expression_list ")“ Notes: 1 is for lookahead; 2 is for semantic action 4 = ( …)? 5 is for token match 6. is for match of other nonterminal

54 lookahead local_lookahead::= "LOOKAHEAD" "(" [ java_integer_literal ] [ "," ] [ expansion_choices ] [ "," ] [ "{" java_expression "}" ] ")“ Notes: 3 componets: max # lookahead + syntax + semantics examples: LOOKHEAD(3) LOOKAHEAD(5, Expr() <INT> | <REAL> , { true} ) More on LOOKAHEAD see minitutorial

55 returntype NT(parameters) throws ParseError;
JavaCC API Non-Terminals in the Input Grammar NT is a nonterminal => returntype NT(parameters) throws ParseError; is generated in the parser class API for Parser Actions Token token; variable always holds the last token and can be used in parser actions. exactly the same as the token returned by getToken(0). two other methods - getToken(int i) and getNextToken() can also be used in actions to traverse the token list.

56 Token class public int kind;
0 for <EOF> public int beginLine, beginColumn, endLine, endColumn; public String image; public Token next; public Token specialToken; public String toString() { return image; } public static final Token newToken(int ofKind)

57 Error reporting and recovery
It is not user friendly to throw an exception and exit the parsing once encountering a syntax error. two Exceptions ParseException .  can be recovered TokenMgrError  not expected to be recovered Error reporting modify ParseExcpetion.java or TokenMgeError.java generateParseException method is always invokable in parser action to report error

58 Error Recovery in JavaCC:
Shallow Error Recovery Deep Error Recovery Ex: void Stm() : {} { IfStm() | WhileStm() } if getToken(1) != “if” or “while” => shallow error

59 Shallow recovery can be recovered by additional choice:
void Stm() : {} { IfStm() | WhileStm() | error_skipto(SEMICOLON) } where JAVACODE void error_skipto(int kind) { ParseException e = generateParseException(); // generate the exception object. System.out.println(e.toString()); // print the error message Token t; do { t = getNextToken(); } while (t.kind != kind);}

60 Deep Error Recovery Same example: void Stm() : {} { IfStm() | WhileStm() } But this time the error occurs during paring inside IfStmt() or WhileStmt() instead of the lookahead entry. The approach: use java try-catch construct. void Stm() : {} { try { ( IfStm() | WhileStm() ) } catch (ParseException e) { error_skipto(SEMICOLON); } note: the new syntax for javacc bnf_production.


Download ppt "Introduction to JavaCC"

Similar presentations


Ads by Google