Presentation is loading. Please wait.

Presentation is loading. Please wait.

A PEG-based pattern matching library EXtended by back reference with regex-like notation in Scala Kota Mizushima Graduate School of Systems and Information.

Similar presentations


Presentation on theme: "A PEG-based pattern matching library EXtended by back reference with regex-like notation in Scala Kota Mizushima Graduate School of Systems and Information."— Presentation transcript:

1 A PEG-based pattern matching library EXtended by back reference with regex-like notation in Scala Kota Mizushima Graduate School of Systems and Information Engineering, University of Tsukuba

2 About myself Name: Kota Mizushima Country: Japan Ph.D student in University of Tsukuba Research: Parsing Algorithm (especially Packrat Parsing) Packrat Parsing will be supported in Scala 2.8 (scala.util.parsing.combinator.PackratParsers) I'm interested in programming languages Currently developing my programming language Onion Object Oriented and Statically Typed

3 Wish Please speak slowly in the questions of my presentation Because I'm not good at English, I can't hear the question correctly if you speak fast

4 Scala in Japan The number of peaple which interest in Scala is increasing rapidly The reason is: Widely known web services started to use Scala Twitter Foursquare Some advanced Java programmers started to use Scala e.g. @ymnk (a committer of Lift), @yuroyoro, @keisuke_n It is expected that functional languages can solve multi- core problems

5 Scala Books in Japan Currently, four book is published in Japan はじめての Scala (it means Scala for beginners) やさしい Scala 入門 (it means easy introduction to Scala. Unfortunately, it is not a good book) Scala スケーラブルプログラミング (Japanese translation of "Programming in Scala" book) Scala プログラミング入門 (Japanese translation of "Beginning Scala" book)

6 Why PEGEX ? - Scala's (PEG) Parser Combinator VS. Regex - Parser Combinator (scala.util.parsing.combinator) Pros. powerful & extensible Cons. more verbose than Regex e.g. "abc" instead of abc, e 1 ~ e 2 instead of e 1 e 2 Regex (scala.util.matching) Pros. brevity Cons. not powerful & not extensible cannot handle recursive structures such as (), (()),....

7 What is PEGEX? Wanted to something which has both power of PEG parser combinator and conciseness of Regex Wanted to back reference in Regex useful for handling the case such as correspondence of XML tags PEGEX is abbreviation of the following: PEG-based pattern matching library EXtended by backreference with regex-like notation named by @kinaba http://twitter.com/kinaba/status/8714614395 (in Japanese)

8 Syntax of PEGEX (1) (Name = e ;)+ Repetition of rules Name is name of the nonterminal e (expression) is consisted of the followings: a: character (e.g. x) [...]: character class (e.g. [a-z]) $: end of input.: any character e*: zero or more repetition (e.g. a*) e+: one or more repetition (e.g. a+) e?: zero or one (e.g. a?)

9 Syntax of PEGEX (2) e 1 e 2 : sequence (e.g. ab) e 1 |e 2 : ordered choice (e.g. a|b) &e: and-predicate (e.g. &(a|b) b) !e: not-predicate (e.g. !a.) predicates don't consume input #(Name): reference to a nonterminal Name #(Tag:e): assign name "Tag" to the parsing result of e ##(Name): backreference (#(Tag))

10 Syntax of PEGEX in PEGEX PEGEX=#(S)(#(Name)#(Eq)#(Expression)\;)+; Name=[a-zA-Z_][a-zA-Z_0-9]+#(S); Eq==#(S); Expression=#(Sequence)(#(BAR)#(Sequence))*; Sequence=#(Prefix)+; Prefix=#(Primary)(#(QUESTION)#(Primary)|...)*;...

11 Basic Usage: Identifier val ident: Pegex = """ L=#(IdentStart)#(IdentRest)*$; IdentStart=[a-zA-Z_]; IdentRest=#(IdentStart)|[0-9]; """.e // Represents identifier. invocation of method e makes an instance of Pegex. println(ident.matches("HogeFooBar")) // Some(HogeFooBar) println(ident.matches("Hoge_Foo_Bar")) //Some(Hoge_Foo_Bar) println(ident.matches("Hoge10")) //Some(Hoge10) println(ident.matches("10Hoge")) //None

12 Example: Nested Comment Parser Combinator lazy val C: Parser[Any] = "/*" ~ (C | not("*/") ~ ".".r).* ~ "*/"; PEGEX C=/\*(#(C)|!(\*/).)*\*/; Regex impossible (usually) PEGEX version is more terse than Parser Combinator version

13 Example: XML-like Language Parser Combinator lazy val E: Parser[String] = (" "[a-z]+".r ") ~ E.* ~ (" "[a-z]+".r ") ^? { case t1 ~ _ ~ t2 if t1 == t2 => t1 } PEGEX E= #(E)* ; I=[a-z]+; Regex impossible (usually)

14 Implementation Written in Scala (About 1300 lines) PEGEX parser, which creates PEGEX AST AST to Parsing VM Instructions compiler Parsing VM Includes several implementations for experiments AST interpreter (greedy) AST interpreter (possessive) Parsing VM (greedy)

15 PEGEX parser About 300 lines PEGEX parser: 150 lines. PEG parser (notation is like normal PEG): 150 lines. Written using scala.util.parsing.combinator Pros. On the fly error checking (with IDE plugin) Cons. Error-reporting is poor Confusion by operator precedence e.g. ~> and <~ have different precedence

16 AST to Parsing VM Instructions compiler Simple and straightforward About only 70 lines Pattern matching and first-class function are excellent features No longer need not write boilerplate code for Visitor pattern foldLeft and map simplify code

17 Parsing VM About 220 lines based on Medeiros' A parsing machine for PEGs consisted of: array of instructions input string pc (index of array of instructions) cursor (index of input string) stack of pc and cursor...

18 Implementation Issue: Packrat or not Parsing technique presented by Ford 2002 Pros. Guarantees linear time parsing by memoization Cons. Memory consumption is large O(n), where n is the size of input Execution overhead Currently memoization code is removed from PEGEX Execution overhead was large in my interpreter

19 Implementation Issue: Possessive or Greedy PEG's operators (e*, e+, e 1 |e 2 ) behave like possessive (not greedy) operators in Regex e.g. ("a".* ~ "a") in PEG parser combinator doesn't match to any input because "a".* consumes all of input Users of Regex may confuse PEGEX support "greedy" flag for PEGEX operators to behave like greedy operators in Regex new Pegex("a*a", likeRegex=true) However, this flag makes parsers slow

20 Current Status Currently, PEGEX still has ragged edges API will be changed frequently Documentation is not enough Source-code is available on GitHub http://github.com/kmizu/pegex Welcome feedback!

21 Future Prospects Better API for actual use Support handling semantic values by actions. val pegex = "L=[1-9][0-9]*|0;".e(v => v.toInt) println(pegex.parse("100").asInstanceOf[Int]) Support many flags in Regex such as (?i) Represents case insensitivity Speeding up Compiles PEGEX to Java bytecode

22 Conclusions Introduction to PEGEX Syntax Example Implementation Overview Future prospects of PEGEX Support handling semantic values Better API Support many flags in Regex Speeding up

23 Thanks for listening


Download ppt "A PEG-based pattern matching library EXtended by back reference with regex-like notation in Scala Kota Mizushima Graduate School of Systems and Information."

Similar presentations


Ads by Google