Presentation is loading. Please wait.

Presentation is loading. Please wait.

[ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand.

Similar presentations


Presentation on theme: "[ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand."— Presentation transcript:

1 [ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand [ ] IT University of Copenhagen Jakob G. Thomsen [ ] Aarhus University Num = 0 | [1-9][0-9]* = [a-z]+ [a-z]+ ("." [a-z]+ )*

2 [ 2 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Abstract We show how to achieve typed and unambiguous declarative pattern matching on strings using regular expressions extended with a simple recording operator. We give a characterization of ambiguity of regular expressions that leads to a sound and complete static analysis. The analysis is capable of pinpointing all ambiguities in terms of the structure of the regular expression and report shortest ambiguous strings. We also show how pattern matching can be integrated into statically typed programming languages for deconstructing strings and reproducing typed and structured values. We validate our approach by giving a full implementation of the approach presented in this paper. The resulting tool, reg-exp-rec, adds typed and unambiguous pattern matching to Java in a stand-alone and non-intrusive manner. We evaluate the approach using several realistic examples.

3 [ 3 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

4 [ 4 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Introduction & Motivation Pattern matching an indispensable problem Many applications need to "parse" dynamic input 1) URLs: 2) Log Files: 3) DBLP: 13/02/ get /support.html 20/02/ post /search.html Three Models for the... Noam Chomsky 1956 protocolhostpath query-string (list of key-value pairs)

5 [ 5 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

6 [ 6 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Language classes (+formalisms) : Type-3 regular expressions "enough" for: URLs, log files, DBLP,... "Trade" (excess) expressivity for: declarativity, simplicity, and static safety ! The Chomsky Hierarchy (1956)

7 [ 7 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Type-0: java.net.URL Turing-Complete programming (e.g., Java) [ "unrestricted grammars" (e.g., rewriting systems) ] Cyclomatic complexity (of official " java.net.URL "): 88 bug reports on Sun's Bug Repository ! Bug reports span more than a decade !

8 [ 8 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Type-1: Context-Sensitivity Not widely used (or studied?) formalism Presumeably because: Restricts expressivity w/o offering extra safety? - ? -

9 [ 9 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Type-2: Context-Free Grammars Conceptually harder than regexps Essentially (Type-3) Regular Expressions + recursion The ultimate end-all scientific argument: We d: regexps 12 times more popular ! (conjecture!)

10 [ 10 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Type-?: Regexp Capture Groups Capturing groups (Perl, PHP, Java regex,...): Syntax:(i.e., in parentheses) Back-references: Syntax:(i.e., "index of" capturing group) Beyond regularity !: is non-regular In fact, not even context-free !!!: is non-context-free (R) \7 (a*)b\1 (.*).\1 { a n b a n | n  0 } {    | ,  * }

11 [ 11 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Type-?: Regexp Capture Groups Interpretation with back-tracking: NP-complete (exponential worst-case)::-( regexp " a? n a n " vs. string " a n " 1 minute 0.02 msecs :1 on strings of length 29 !!!

12 [ 12 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Type-3: Regular Expressions Closure properties: Union Concatenation Iteration Restriction Intersection Complement... Decidability properties:... Containment: L (R)  L (R') Ambiguity... Declarative ! Safe ! Simple !

13 [ 13 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

14 [ 14 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Regular Expressions Syntax: Semantics: where: L 1  L 2 is concatenation (i.e., {  1  2 |  1  L 1,  2  L 2 }) L * =  i  0 L i where L 0 = {  } and L i = L  L i-1

15 [ 15 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Common Extensions (sugar) Any character (aka, dot): ". "asc 1 |c 2 |...|c n, c i  Character ranges: " [a-z] "as a | b |...| z One-or-more regexps: " R+ "as R  R* Optional regexp: " R? "as  |R Various repetitions; e.g.: " R{2,3} "as R  R  R?

16 [ 16 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

17 [ 17 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Recording Syntax: " x " is a recording identifier (it "remembers" the substring it matches) Semantics: Example (simplified s): Matching against string: yields: [a-z]+ [a-z]+ ("." [a-z]+)* user = "obama" domain = "whitehouse.gov" & NB: cannot use DFAs / NFAs ! - only recognition (yes / no) - not how (i.e., "the structure")

18 [ 18 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Recording (structured) Another example (with nested recordings): Matching against string: yields: "26/06/1992" date.day = 26 date.month = 06 date.year = 1992 date = 26/06/1992

19 [ 19 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Recording (structured, lists) Yet another example (yielding lists): Matching against string: yields a list structure: " & " "obama & bush" name = [obama,bush] ( "\n" )* (" & " )*

20 [ 20 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

21 [ 21 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Abstract Syntax Trees (ASTs)

22 [ 22 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Ambiguity Definition: R ambiguous iff  T,T'  AST R : T  T'  ||T|| = ||T'|| where ||  ||: AST   * (the flattening) is:  T R  T' R'  =

23 [ 23 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Characterization of Ambiguity Theorem: R unambiguous iff NB: sound & complete ! R* =  | R  R*

24 [ 24 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Examples Ambiguous: a|a L ( a )  L ( a ) = { a }  Ø a*  a* L ( a* ) L ( a* ) = { a n }  Ø Unambiguous: a|aa L ( a )  L ( aa ) = Ø a*  ba* L ( a* ) L ( ba* ) = Ø    

25 [ 25 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Ambiguity Examples a?b+|(ab)* (a|ab)  (ba|a) (aa|aaa)* *** ambiguous concatenation: (a|ab) (ba|a) shortest ambiguous string: "aba" *** ambiguous choice: a?b+ (ab)* shortest ambiguous string: "ab" *** ambiguous star: (aa|aaa)* shortest ambiguous string: "aaaaa"

26 [ 26 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Ambiguity vs. Recordings Ambiguities inside recordings:...is not a problem! Contextual composition (of recordings):  a* | a...is a problem! Note: our tool tests only for these!

27 [ 27 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

28 [ 28 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Disambiguation 1) Manual rewriting: Always possible :-) Tedious :-( Error-prone :-( Not structure-preserving :-( 3) Disambiguators: From characterization: concat:'  L ', '  R ' choice:' | L ', ' | R ' star:' * L ', ' * R ' (partial-order on ASTs) 2) Restriction: R 1 - R 2 And then encode...: R C as:  * - R R 1 & R 2 as: (R 1 C |R 2 C ) C 4) Default disamb: concat, choice, and star are all left-biassed (by default) ! (Our tool does this)

29 [ 29 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Quizzz (Restriction vs. Recording) Which can have recordings? A) R 1, R 2, R 3, R 4, and R 5 can have recordings B) R 1, R 3, R 4, and R 5 can have recordings C) R 1, R 4, and R 5 can have recordings D) R 1 can have recordings E) None of them can have recordings R 1 - R 2 R 3 C as:  * - R 3 R 4 & R 5 as: (R 4 C |R 5 C ) C i.e., where do recordings make sense?

30 [ 30 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

31 [ 31 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Type Inference Type Inference: R : (L,S)

32 [ 32 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Examples (Type Inference) Regexp: Usage: Person = " (" ")" class Person { // auto-generated String name; int age; static Person match(String s) {... } public String toString() {... } } String s = "obama (48)"; Person p = Person.match(s); print(p.name + " is " + p.age + "y old"); compile (our tool)

33 [ 33 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Examples (Type Inference) Usage: People = ( $Person "\n" )* class People { // auto-generated String[] name; int[] age; static Person match(String s) {... } public String toString() {... } } compile (our tool) String s = "obama (48) \n bush (63) \n "; People p = People.match(s); println("Second name is " + p[1].name); Person = " (" ")"

34 [ 34 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Examples (Type Inference) Usage: People = ( "\n" )* ; class People { // auto-generated Person[] person; class Person { // nested class String name; int age; }... } compile (our tool) String s = "obama (48) \n bush (63) \n "; People people = People.match(s); for (p : people.person) println(p.name); Person = " (" ")"

35 [ 35 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

36 [ 36 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 URLs URLs: Regexp: Query string further structured (list of key-value pairs): "http://www.google.com/search?q=record&hl=en" protocolhostpath query-string (list of key-value pairs) Host = ; Path = ; Query = ; URL = "http://" $Host "/" $Path "?" $Query ; KeyVal = "=" ; Query = $KeyVal ("&" $KeyVal)* ; (list of key-value pairs)

37 [ 37 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 URLs (Usage Example) Regexp: Usage (example): Host = ; Path = ; KeyVal = "=" ; Query = $KeyVal ("&" $KeyVal)* ; URL = "http://" $Host "/" $Path "?" $Query ; String s = "http://www.google.com/search?q=record"; URL url = URL.match(s); print("Host is: " + url.host); if (url.key.length>0) print("1st key: " + url.key[0]); for (String val : url.val) println("value = " + val);

38 [ 38 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Log Files 13/02/ /support.html 20/02/ /search.html... Date = ; IP = ; Entry = ; Log = $Entry * ; Log log = Log.match(log_file); for (Entry e : log.entry) if (e.date.month == 02 && e.date.day == 29) print("Access on LEAP YEAR from IP# " + e.ip); Format Regexp Usage

39 [ 39 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Log Files (cont'd, ambiguity) Assume we forgot " / " (between day & month): Ambiguity: i.e. " 1/01 " (January 1) vs. " 10/1 " (January 10) :-) *** ambiguous concatenation: shortest ambiguous string: "101" Day = 0?[1-9] | [1-2][0-9] | 30 | 31 ; Month = 0?[1-9] | 10 | 11 | 12 ; Date = // no slash ! "/" > ; Regexp Error

40 [ 40 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 DBLP (Format) DBLP (XML) Format: Noam Chomsky Three Models for the Description of Language 1956 IRE Transactions on Information Theory Claus Brabrand Jakob G Thomsen Typed and Unambiguous Pattern Matching on Strings using Regular Expressions 2010 Submitted...

41 [ 41 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 DBLP (Regexp) DBLP Regexp: Ambiguity !: EITHER 2 publications (.* = "" ) OR 1 publication (.* = gray part) !!! Author = " " " " ; Title = " " " " ; Article = " " $Author* $Title.* " " ; DBLP = * ; *** ambiguous star: * shortest ambiguous string: "

42 [ 42 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 DBLP (Disambiguated) DBLP Regexp: Disambiguated (using " (R 1 -R 2 ) "): Unambiguous! :-) Article = " " $Author* $Title (.* - (.* " ".*)) " " ; Author = " " " " ; Title = " " " " ; Article = " " $Author* $Title.* " " ; DBLP = * ;

43 [ 43 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 DBLP (Usage Example) DBLP Regexp: Usage (example): DBLP dblp = DBLP.match(readXMLfile("DBLP.xml")); for (Article a: dblp.article) print("Title: " + a.title); Author = " " " " ; Title = " " " " ; Article = " " $Author* $Title.* " " ; DBLP = * ;

44 [ 44 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

45 [ 45 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Evaluation Evaluation summary: Also, (Type-3) regexps expressive "enough" for: URLs, Log files, DBLP,... [ MatMult ][ NP-Complete ][ Frisch&Cardelli'04 ]

46 [ 46 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Type-3 vs. Type-0 (URLs) Regexps vs. Java: Regexps are 8 times more concise !

47 [ 47 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 java.util.regex vs. Our approach Efficiency (on DBLP): java.util.regex : Exponential O (2 |  | ) 2,500 chars in 2 mins ! In contrast; ours: Linear (on DBLP) 1,200,000 chars in 6 secs ! 2 mins 10 msecs

48 [ 48 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Related Work Recording (with lists in general): " x as R " in XDuce; " x::R " in CDuce; and " " in Scala and HaRP Ambiguity: [Book+Even+Greibach+Ott'71] and [Hosoya'03] for XDuce but indirectly via NFAa, not directly (syntax-directed) Disambiguation: [Vansummeren'06] but with global, not local disambiguation Type inference: Exact type inference in XDuce & CDuce (soundness+completeness proof in [Vansummeren'06]) but not for stand-alone and non-intrusive usage (Java)

49 [ 49 ] C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark May 11, 2010 Conclusion For string pattern matching, it is possible to: In conclusion: i.e., ambiguity checking and type inference ! + stand-alone & non-intrusive language integration (Java) ! We conclude that if regular expressions are sufficiently expressive, they provide a simple, declarative, and safe means for pattern matching on strings, capable of extracting highly structural information in a statically type-safe and unambiguous manner. "trade ( excess ) expressivity for safety+simplicity"

50 [ 50 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Questions ? Complaints ? [ ]


Download ppt "[ 1 ] May 11, 2010 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark Pattern Matching on Strings using Regular Expressions Claus Brabrand."

Similar presentations


Ads by Google