Presentation Outline XDuce: Introduction Regular Expression Types Regular Expression Pattern Matching Algorithms for Pattern Matching Type Inference Conclusion / Future Works References Xperl(?)
XDuce: For What? A functional language for XML processing On the basis of Regular Expression Types, and Pattern Matching Statically Typed i.e. Outputs are statically checked against DTD-conformance etc.
Advantages (vs. “untyped”) “Untyped” XML processing: programs using DOM etc. Little connection between program and XML schema Validity can be checked only at run-time, if any
Advantages (vs. “embedding”) “Embedding” : mapping XML schema into language’s type system. e.g. (DTD) ↓ type person = Person of name * mail list * tel option (ML)
Advantages (vs. “embedding”) Embedding does not suit intuition in some cases. e.g. Intuitively… (name, mail*, tel?) <: (name, mail*, tel*) but not name * mail list * tel option <: name * mail list * tel list
Language Features (1/2) ML-like pattern matching e.g. match p with | person(name[n], (ms as mail*), tel[t]) -> (* case: p has a tel *) | person(name[n], (ms as mail*)) -> (* case: p has no tel *) …
Language Features (2/2) Type inference e.g. if type Person = person[name[String], mail*, tel[String]?] and p :: Person then match p with person[name[n], (ms as mail*)] ⇒ n :: String, ms :: mail* are inferred.
Applications Bookmarks (Mozilla bookmark extraction) Html2Latex Diff (diff for XML) All 300 – 350 lines.
Regular Expression Types Types are defined in regular expression form with labels Concatanation, alternation, repetition as basic constructors Labels correspond to elements of XML (person, name, mail, etc…)
Syntax of Types T::= () | X | l[T] | T, T(* concat. *) | T|T (* alt. *) | T*(* rep. *) where X : Type Variables l : Labels
Recursive Types Types can be (mutually) recursive. e.g. type Folder = Entry* type Entry = name[String], file[File] | name[String], folder[Folder]
Subtyping Meaning of subtypes is as usual: All values t of T are also values of T’ T <: T’ ⇔ t ∈ T ⇒ t ∈ T’
Subtagging Subtaggings are user-defined “ad-hoc” subtype relation between labels e.g. small tag is a special case of tag (in HTML)
Complexity of Subtyping Subtype relation (T <: T’) is equivalent to inclusion of CFGs: Undecidable! Need some restrictions on syntax (next slide…)
Well-formedness of Types Syntactic restriction on types to ensure “regularity” Recursive use of types can only occur at the tail position of type definition, or inside labels.
Well-formed Types: Examples type X = Int, Y type Y = String, X | () and type Z = String, lab[Z], String | () are well-formed, but type U = Int, U, String | () is not.
Complexity of Subtyping, again With well-formedness, checking subtype relation is: Still EXPTIME-complete, but acceptable in practical cases.
Pattern Matching Pattern match can also involve regular expression types. e.g. match p with | person[name[n], (ms as mail*), (t as tel?) -> …
Policies of Pattern Matching Pattern matching has two basic policies: First-match (as in ML): only the first pattern matched is taken Longest-match (as usual in regexp. matching on string): matching is done as much as possible
First-match: Example (* p = person[name[…], mail, tel[…]] *) match p with | person(name[n], (ms as mail*), tel[t]) -> (* invoked *) | person(name[n], (ms as mail*), (tl as tel?) -> (* not invoked *)
Longest-match: Example (* p = person[name mail, mail, tel] *) match p with | … (m1 as mail*), (m2 as mail*), … -> (* m1 = mail, mail m2 = () *)
Exhaustiveness and Redundancy Pattern matches are checked against exhaustiveness and redundancy. Exhaustiveness: No “omission” of values Redundancy: Never-matched patterns
Exhaustiveness A pattern match P 1 -> e 1 | … | P n -> e n is exhaustive (wrt. input type T) ⇔ All values t ∈ T are matched by some P i or T <: P 1 | … | P n
Exhaustiveness: Example (1/2) match p with | person[name[n], (ms as mail*), tel[t]] ->... | person[name[n], (ms as mail*)] ->... is exhaustive patterns (wrt. Person)
Exhaustiveness: Example (2/2) match p with | person[name[n], (ms as mail*), tel[t]] ->... | person[name[n], (ms as mail+)] ->... is NOT exhaustive (wrt. Person): person[name[...]] does not match
Redundancy A pattern P i is redundant in P 1 -> e 1 | … | P n -> e n (wrt. input type T) ⇔ All values matched by P i is matched by P 1 |... | P i-1
Redundancy: Example match p with | person[name[n], (ms as mail*), (tl as tel?)] ->... | person[name[n], (ms as mail*)] ->... Second pattern is redundant: anything match second pattern also match first one.
Algorithms for Pattern Matching Pattern matching takes following steps Translation of values into internal forms (binary trees) Translation of types and patterns into internal forms (binary trees and tree automata) Values are matched by patterns, in terms of tree automata
Internal Forms of Values Values are represented as binary trees internally t::=ε(* leaves) | l(t, t)(* labels *) First node is content of the label, second is remainder of the sequence.
Internal Forms of Values: Example person[name, mail, mail] is translated into person(name(ε, mail(ε, mail(ε, ε))), ε)
Internal Forms of Types Types are also translated into binary trees T::=φ(* empty *) | ε (* leaves *) | T|T | l(X, X) X is States, used in tree automata
Internal Forms of Types: Tree Automata A tree automaton M is a mapping of States -> Types e.g. M(X) = name(Y, Z) M(Y) = ε M(Z) = mail(Y, Z) | ε...
Internal Forms of Types: Example type Person = person[name, mail*, tel?] is translated into binary tree: person(X1, X0) and tree automaton M, s.t. M(X0) = ε M(X1) = name(X0, X2), M(X2) = mail(X0, X2) | mail(X0, X3) | ε M(X3) = tel(X0, X0)
Internal Forms of Patterns Patterns are similar to types, with some additions P::=(* same as types... *) | x : P(* x as P *) | T (* wildcard *) Wildcards are used for non “as”-ed variables
Internal Forms of Patterns: Example Pattern person[name[n], (ms as mail*)] is translated into binary tree person(Y1, Y0) and tree automaton N, s.t. N(Y0) = ε N(Y1) = name(n: T, ms:Y2) N(Y2) = mail(Y0, Y2) | ε
Pattern Matching (1/3) Pattern matching has two roles match input values (of course!) bind variables to components of input value, if matched Written formally t ∈ D ⇒ V “t is matched by D, yielding V” (V : Vars -> Values)
Pattern Matching (2/3) Matching relation t ∈ D ⇒ V is defined by following rules... (next slide) Assumptions: D is a set of patterns and states A tree automaton N is implied (D, N) corresponds to the external pattern
Type Inference (1/2) Infer types of variables in patterns Results are exact types of variables Type of each variable depends on pattern itself, and type of input
Type Inference (2/2) Type inference is “flow-sensitive” In P 1 -> e 1 | … | P n -> e n, inference on P i depends on P 1... P i-1 Because… Values matched by P i are those NOT matched by P 1... P i-1
Type Inference: Example (1/2) (* p :: person[name, mail*, tel?] *) match p with | person[name, rest] -> … Type of rest inferred is mail*, tel? In this case
Type Inference: Example (2/2) match p with | person[name, tel] -> … | person[name, rest] -> … Type of rest becomes (mail+, tel?) | () In this case, because… person[name, (), tel] Is matched by the first pattern
Type Inference: Limitations “Exact” type inference is possible only on Variables at tail position, or Inside labels (c.f. well-formedness) Limitation comes from internal representation of patterns (binary trees)
Conclusion Expressiveness of regular expression types/pattern matching are useful for XML processing. Type inference (including subtype relation) is possible and efficient (in most practical cases).
Future Works Precise type inference on all variables Introducing Any type: Not possible by naïve way Breaks closure-property of tree automata Makes type inference impossible
References Regular Expression Pattern Matching for XML: Hosoya and Pierce Regular Expression Types for XML: Hosoya, Vouillon, and Pierce Available @ http://xduce.sourceforge.net/papers.html
Xperl(?) My own current research Regular expression types for Perl Motivation: Scripting languages are used more widely will live longer than XML
Features (in mind) Regular expression (but not tree) types Infer outputs of scripts, etc. Detect possible run-time errors
Progress Report (1/3) Parsing: Nightmare! ASTs can be extracted through debug interface, fortunately :-p
Progress Report (2/3) Semantics: No specification but implementation Trying from scratch, step by step Queer, esp. around side-effects and data structures First attempt in the world?
Progress Report (3/3) Type System: Working along with semantics Types are regular expressions: τ ::= ε|α| ττ | τ|τ | τ* … Preliminary implementation of inference Still VERY trivial...
Resources No documentations yet. Working note is placed @ http://tabee.com/private/lab/xperl/defn.dvi AS-IS.