Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pattern Matching on Strings using Regular Expressions

Similar presentations


Presentation on theme: "Pattern Matching on Strings using Regular Expressions"— Presentation transcript:

1 Pattern Matching on Strings using Regular Expressions
Num = 0 | [1-9][0-9]* = [a-z]+ [a-z]+ ("." [a-z]+ )* Claus Brabrand [ ] IT University of Copenhagen Jakob G. Thomsen [ ] Aarhus University

2 Outline Pattern Matching (intro & motiv): Regular Expressions:
The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

3 Introduction & Motivation
Pattern matching an indispensable problem Many applications need to "parse" dynamic input 1) URLs: 2) Log Files: 3) DBLP: (list of key-value pairs) protocol host path query-string 13/02/ get /support.html 20/02/ post /search.html <article> <title>Three Models for the...</title> <author>Noam Chomsky</author> <year>1956</year> </article>

4 Outline Pattern Matching (intro & motiv): Regular Expressions:
The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

5 The Chomsky Hierarchy (1956)
Language classes (+formalisms): Type-3 regular expressions "enough" for: URLs, log files, DBLP, ... "Trade" (excess) expressivity for: declarativity, simplicity, and static safety !

6 Type-0: java.net.URL Turing-Complete programming (e.g., Java)
[ "unrestricted grammars" (e.g., rewriting systems) ] Cyclomatic complexity (of official "java.net.URL"): 88 bug reports on Sun's Bug Repository ! Bug reports span more than a decade !

7 Type-1: Context-Sensitivity
Not widely used (or studied?) formalism Presumeably because: Restricts expressivity w/o offering extra safety? - ? -

8 Type-2: Context-Free Grammars
Conceptually harder than regexps Essentially (Type-3) Regular Expressions + recursion The ultimate end-all scientific argument: We d: (conjecture!) regexps 12 times more popular !

9 Type-?: Regexp Capture Groups
Capturing groups (Perl, PHP, Java regex, ...): Syntax: (i.e., in parentheses) Back-references: Syntax: (i.e., "index of" capturing group) Beyond regularity !: is non-regular In fact, not even context-free !!!: is non-context-free (R) \7 (a*)b\1 { an b an | n  0 } (.*).\1 {    | , * }

10 Type-?: Regexp Capture Groups
Interpretation with back-tracking: NP-complete (exponential worst-case): :-( regexp " a?nan " vs. string " an " 1 minute 0.02 msecs :1 on strings of length 29 !!!

11 Type-3: Regular Expressions
Simple ! Declarative ! Safe ! Closure properties: Union Concatenation Iteration Restriction Intersection Complement ... Decidability properties: ... Containment: L(R)  L(R') Ambiguity

12 Outline Pattern Matching (intro & motiv): Regular Expressions:
The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

13 Regular Expressions Syntax: Semantics: where:
L1  L2 is concatenation (i.e., { 1 2 | 1L1, 2L2 }) L* = i0 Li where L0 = {  } and Li = L  Li-1

14 Common Extensions (sugar)
Any character (aka, dot): "." as c1|c2|...|cn, ci Character ranges: "[a-z]" as a|b|...|z One-or-more regexps: "R+" as RR* Optional regexp: "R?" as |R Various repetitions; e.g.: "R{2,3}" as RRR?

15 Outline Pattern Matching (intro & motiv): Regular Expressions:
The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

16 Recording Syntax: Semantics: Example (simplified emails):
"x " is a recording identifier (it "remembers" the substring it matches) Semantics: Example (simplified s): Matching against string: yields: NB: cannot use DFAs / NFAs ! - only recognition (yes / no) - not how (i.e., "the structure") <user = > <domain = > [a-z] [a-z]+ ("." [a-z]+)* user = "obama" & domain = "whitehouse.gov"

17 Recording (structured)
Another example (with nested recordings): Matching against string: yields: <date = <day = [0-9]{2} > "/" <month = [0-9]{2} > "/" <year = [0-9]{4} > > "26/06/1992" date = 26/06/1992 date.day = 26 date.month = 06 date.year = 1992

18 Recording (structured, lists)
Yet another example (yielding lists): Matching against string: yields a list structure: <name = [a-z]+ > " & " <name = [a-z]+ > ( <name = [a-z]+ > "\n" )* <name = [a-z]+ > (" & " <name = [a-z]+ > )* "obama & bush" name = [obama,bush]

19 Outline Pattern Matching (intro & motiv): Regular Expressions:
The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

20 Abstract Syntax Trees (ASTs)

21 Ambiguity Definition:  R ambiguous iff
T,T'ASTR: T  T'  ||T|| = ||T'|| where ||||: AST  * (the flattening) is: T R T' R' =

22 Characterization of Ambiguity
Theorem: R unambiguous iff NB: sound & complete ! R* =  | RR*

23 Examples Ambiguous: a|a a*a* Unambiguous: a|aa a*ba*
L(a)  L(a) = { a }  Ø a*a* L(a*) L(a*) = { an }  Ø Unambiguous: a|aa L(a)  L(aa) = Ø a*ba* L(a*) L(ba*) = Ø

24 Ambiguity Examples a?b+|(ab)* (a|ab)(ba|a) (aa|aaa)*
*** ambiguous choice: a?b+ <-|-> (ab)* shortest ambiguous string: "ab" *** ambiguous concatenation: (a|ab) <--> (ba|a) shortest ambiguous string: "aba" *** ambiguous star: (aa|aaa)* shortest ambiguous string: "aaaaa"

25 Outline Pattern Matching (intro & motiv): Regular Expressions:
The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

26 Disambiguation 1) Manual rewriting: 3) Disambiguators: 2) Restriction:
Always possible :-) Tedious :-( Error-prone :-( Not structure-preserving :-( 3) Disambiguators: From characterization: concat: 'L', 'R' choice: '|L', '|R' star: '*L', '*R' (partial-order on ASTs) 2) Restriction: R1 - R2 And then encode...: RC as: * - R R1 & R2 as: (R1C|R2C)C 4) Default disamb: concat, choice, and star are all left-biassed (by default) ! (Our tool does this)

27 Outline Pattern Matching (intro & motiv): Regular Expressions:
The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

28 Type Inference Type Inference: R : (L,S)

29 Examples (Type Inference)
Regexp: Usage: Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")" class Person { // auto-generated String name; int age; static Person match(String s) { ... } public String toString() { ... } } compile (our tool) String s = "obama (48)"; Person p = Person.match(s); print(p.name + " is " + p.age + "y old");

30 Examples (Type Inference)
Usage: Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")" People = ( $Person "\n" )* class People { // auto-generated String[] name; int[] age; static Person match(String s) { ... } public String toString() { ... } } compile (our tool) String s = "obama (48) \n bush (63) \n "; People p = People.match(s); println("Second name is " + p[1].name);

31 Examples (Type Inference)
Usage: Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")" People = ( <person = $Person > "\n" )* ; class People { // auto-generated Person[] person; class Person { // nested class String name; int age; } ... } compile (our tool) String s = "obama (48) \n bush (63) \n "; People people = People.match(s); for (p : people.person) println(p.name);

32 Outline Pattern Matching (intro & motiv): Regular Expressions:
The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

33 (list of key-value pairs)
URLs URLs: Regexp: Query string further structured (list of key-value pairs): (list of key-value pairs) " protocol host path query-string (list of key-value pairs) Host = <host = [a-z]+ ("." [a-z]+ )* > ; Path = <path = [a-z/.]* > ; Query = <query = [a-z&=]* > ; URL = " $Host "/" $Path "?" $Query ; KeyVal = <key = [a-z]* > "=" <val = [a-z]* > ; Query = $KeyVal ("&" $KeyVal)* ;

34 URLs (Usage Example) Regexp: Usage (example):
Host = <host = [a-z]+ ("." [a-z]+ )* > ; Path = <path = [a-z/.]* > ; KeyVal = <key = [a-z]* > "=" <val = [a-z]* > ; Query = $KeyVal ("&" $KeyVal)* ; URL = " $Host "/" $Path "?" $Query ; String s = " URL url = URL.match(s); print("Host is: " + url.host); if (url.key.length>0) print("1st key: " + url.key[0]); for (String val : url.val) println("value = " + val);

35 Log Files Format Date = <date = <day = $Day > "/"
13/02/ /support.html 20/02/ /search.html ... Format Date = <date = <day = $Day > "/" <month = $Month > "/" <year = [0-9]{4} > > ; IP = <ip = [0-9]{1,3} ("." [0-9]{1,3} ){3} > ; Entry = <entry = $Date " " $IP " " $Path "\n" > ; Log = $Entry * ; Regexp Log log = Log.match(log_file); for (Entry e : log.entry) if (e.date.month == 02 && e.date.day == 29) print("Access on LEAP YEAR from IP# " + e.ip); Usage

36 Log Files (cont'd, ambiguity)
Assume we forgot "/" (between day & month): Ambiguity: i.e. "1/01" (January 1) vs. "10/1" (January 10) :-) Day = 0?[1-9] | [1-2][0-9] | 30 | 31 ; Month = 0?[1-9] | 10 | 11 | 12 ; Date = <date = <day = $Day > // no slash ! <month = $Month > "/" <year = [0-9]{4} > > ; Regexp *** ambiguous concatenation: <day> <--> <month> shortest ambiguous string: "101" Error

37 DBLP (Format) DBLP (XML) Format: <article>
<author>Noam Chomsky</author> <title>Three Models for the Description of Language</title> <year>1956</year> <journal>IRE Transactions on Information Theory</journal> </article> <author>Claus Brabrand</author> <author>Jakob G Thomsen</author> <title>Typed and Unambiguous Pattern Matching on Strings using Regular Expressions</title> <year>2010</year> <note>Submitted</note> ...

38 DBLP (Regexp) DBLP Regexp: Ambiguity !:
EITHER 2 publications (.* = "") OR publication (.* = gray part) !!! Author = "<author>" <author = [a-z]* > "</author>" ; Title = "<title>" <title = [a-z]* > "</title>" ; Article = "<article>" $Author* $Title .* "</article>" ; DBLP = <pub = $Article > * ; *** ambiguous star: <pub>* shortest ambiguous string: "<article><title></title></article> <article><title></title></article>"

39 DBLP (Disambiguated) DBLP Regexp: Disambiguated (using "(R1-R2)"):
Unambiguous! :-) Author = "<author>" <author = [a-z]* > "</author>" ; Title = "<title>" <title = [a-z]* > "</title>" ; Article = "<article>" $Author* $Title .* "</article>" ; DBLP = <pub = $Article > * ; Article = "<article>" $Author* $Title (.* - (.* "</article>" .*)) "</article>" ;

40 DBLP (Usage Example) DBLP Regexp: Usage (example):
Author = "<author>" <author = [a-z]* > "</author>" ; Title = "<title>" <title = [a-z]* > "</title>" ; Article = "<article>" $Author* $Title .* "</article>" ; DBLP = <article = $Article > * ; DBLP dblp = DBLP.match(readXMLfile("DBLP.xml")); for (Article a: dblp.article) print("Title: " + a.title);

41 Outline Pattern Matching (intro & motiv): Regular Expressions:
The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion

42 Evaluation Evaluation summary:
Also, (Type-3) regexps expressive "enough" for: URLs, Log files, DBLP, ... [ Frisch&Cardelli'04 ] [ NP-Complete ] [ MatMult ]

43 Regexps are 8 times more concise !
Type-3 vs. Type-0 (URLs) Regexps vs. Java: Regexps are 8 times more concise !

44 java.util.regex vs. Our approach
Efficiency (on DBLP): java.util.regex: Exponential O(2||) ,500 chars in 2 mins ! In contrast; ours: Linear (on DBLP) 1,200,000 chars in 6 secs ! 2 mins 10 msecs

45 Related Work Recording (with lists in general): Ambiguity:
"x as R" in XDuce; "x::R" in CDuce; and in Scala and HaRP Ambiguity: [Book+Even+Greibach+Ott'71] and [Hosoya'03] for XDuce but indirectly via NFAa, not directly (syntax-directed) Disambiguation: [Vansummeren'06] but with global, not local disambiguation Type inference: Exact type inference in XDuce & CDuce (soundness+completeness proof in [Vansummeren'06]) but not for stand-alone and non-intrusive usage (Java)

46 Conclusion For string pattern matching, it is possible to:
In conclusion: i.e., ambiguity checking and type inference ! + stand-alone & non-intrusive language integration (Java) ! "trade (excess) expressivity for safety+simplicity" We conclude that if regular expressions are sufficiently expressive, they provide a simple, declarative, and safe means for pattern matching on strings, capable of extracting highly structural information in a statically type-safe and unambiguous manner.

47 [ http://www.cs.au.dk/~gedefar/reg-exp-rec/ ]
</Talk> [ ] Questions ? Complaints ?


Download ppt "Pattern Matching on Strings using Regular Expressions"

Similar presentations


Ads by Google