MONADIC QUERIES over TREE-STRUCTURED DATA Georg Gottlob TU Wien & Oxford University Joint work with Christoph Koch, Robert Baumgartner, and Marcus Herzog, and Reinhard Pichler
Talk Outline Semistructured data HTML, XML Monadic Queries Monadic datalog over trees Xpath Web information extraction (wrapping) Lixto
Strings, Trees, Graphs, & Logic Büchi: MSO=REG over strings Rabin: decidability of S2S Thatcher and Wright: MSO = REG over ranked trees (tree automata) Brüggemann-Klein/Wood/Murata: MSO = REG over unranked trees Fagin: ESO = NP Note: over graphs ESO NP-hard, MSO hard for Pol. Hierarchy. Grädel/Immerman/Vardi: ESO(Horn)=Datalog=LFP=PTIME (on ordered structures) Courcelle MSO in LinTime on tree-like structures (treewidth <= k) Clarke, Emerson, Pnueli, et al: CTL, LTL … A few well-known results:
Web documents are trees ! HTML: Hypertext Markup Language XML: Extensible Markup Language HTML, XML: Context free languages. Represent a document by its parse tree. Tags: vertex labels Labeled trees.
DBAI Georg Gottlob Christoph Koch HTML Example Georg Christoph DBAI htmlbody table tr td tr td Christoph Koch Georg h1
DBAI Georg Gottlob Christoph Koch HTML Example Georg Christoph DBAI htmlbody table tr td tr td Christoph Koch Georg h1
DBAI Georg Gottlob Christoph Koch HTML Example Georg Christoph DBAI htmlbody table tr td tr td Christoph Koch Georg h1
……. paperDB paper author title chandramerlin“Conjunctive Queries” paper author title …… XML Example … …
paper author title chandramerlin“Conjunctive Queries” Ordered Trees as finite structures Child-relation is a priori unordered fc = first child ns = next sibling paper authortitle “Conj. Queries”chandramerlin fc ns fc ns
Core XPath simple location steps paper/title loc. steps with explicit axes paper/descendant::merlin qualifiers paper[…..] Boolean logic...[chandra and merlin and (not harel)] Full Xpath: node set comparisons and operations order functions (first, last, position), etc. arithmetic and string operations Implementations: in the context of XSLT processors Xalan, XT, MS Internet Explorer (IE6)
XPath Examples /descendant::a/child::b a d a b b b c c c /descendant::a/child::b[ descendant::c and not(following-sibling::d)] a d a b b b c c c /descendant::a/child:b[ following-sibling::d] a d a b b b c c c
paper author title chandramerlin“Conjunctive Queries” Ordered Trees as finite structures Child-relation is a priori unordered fc = first child ns = next sibling paper authortitle “Conj. Queries”chandramerlin fc ns fc ns U = aa
Monadic Queries over Trees Web Information Extraction ( later) Monadic XML Queries Select some nodes of a tree Unary query f: Trees 2 dom Select titles of articles authored by Chandra and Merlin No Joins or combinations of objects Yardstick: Monadic Second Order Logic (MSO) Two important applications:
Monadic Datalog over Trees Select titles of articles authored By Chandra and Merlin paper authortitle “Conj. Queries”chandramerlin fc ns fc ns paperDB fc paper ns
Monadic Datalog over Trees paper authortitle “Conj. Queries”chandramerlin fc ns fc ns paperDB fc paper ns paper(X) root(R) & firstchild(R,X). paper(X) paper(Y) & nextsibling(Y,X). output(X) paper(P) & firstchild(P,A) & firstchild(A,Z) & label Chandra (Z) & nextsibling(Z,V) & label Merlin (V) & nextsibling(A,T) & firstchild(T,X). ns
How expressive is monadic Datalog? Over U, Monadic Datalog = MSO It was known that: Monadic Datalog 1 -MSO Full Datalog = P Theorem [G. & Koch 2002]: A unary query is definable in MSO iff it is definable via a monadic datalog program.
Proof idea: Simulate Unranked Query Automata (UQA) by Neven and Schwentick in mon. Datalog UQA Unary MSO Queries [Neven & Schwentick 01]
Example: “Even-query” Up transition Proof idea: Simulate Unranked Query Automata (UQA) by Neven and Schwentick in mon. Datalog
Example: “Even-query” 0010 Up transition 01 Proof idea: Simulate Unranked Query Automata (UQA) by Neven and Schwentick in mon. Datalog
Example: “Even-query” 0010 Up transition qodd(X) :- 0(Y), lastchild(X, Y). 01 Proof idea: Simulate Unranked Query Automata (UQA) by Neven and Schwentick in mon. Datalog
How complex is Monadic Datalog? Monadic Datalog over U has combined complexity: O(|data| * |query|) Data Complexity: P-complete and linear-time. Theorem [G. & Koch 2002]: Previously known facts on full Datalog over Graphs: Data Complexity of Datalog: P-complete (impl. in [Vardi 88]) Combined Complexity EXPTIME-complete (impl. [Vardi 88]) Comb. Compl. of sirups: EXPTIME-cplt. ([G. & Papadimitriou 99])
Proof idea: 1.) Transform datalog program + input tree in linear time into a “ground” propositional logic program Exploit functional dependencies: nextsibling(X,Y) has only a linear number of ground instances: nextsibling(n i,n j ), etc. Decouple independent atoms of rule bodies p(X) q(X) & r(Y) & nextsibling(X,Z) & s(Z). p(X) q(X) & r & nextsibling(X,Z) & s(Z). r r(Y). 2.) Execute ground program in linear time by using well-known algorithms: [Dowling&Gallier] [Minoux]
XPath chandra paper authortitle “Conj. Queries”merlin fc ns fc ns paperDB fc Paper… ns //paper[author[chandra and merlin]]/title /descendant::paper[child::author[child::chandra and child::merlin]]/child::title Unabbreviated syntax with explicit axes: /descendant::chandra/following-sibling::merlin/ancestor::paper/child::title W3C-standard; kernel of XSLT, XQUERY, etc.
chandra paper authortitle “Conj. Queries”merlin fc ns fc ns paperDB fc Paper… ns desc. Core XPath: A tree morphism problem anc. child chandra root merlin paper title query tree w. location steps data tree foll-s. /descendant::chandra/following-sibling::merlin/ancestor::paper/child::title
chandra paper authortitle “Conj. Queries”merlin fc ns fc ns paperDB fc Paper… ns /descendant::chandra/nextsibling::merlin/ancestor::paper/child::title desc. Core XPath: A tree morphism problem foll-s. anc. child chandra root merlin paper title query tree w. location steps data tree
Core XPath simple location steps paper/title loc. steps with explicit axes paper/descendant::merlin qualifiers paper[…..] Boolean logic...[chandra and merlin and (not harel)] Full Xpath: node set comparisons and operations order functions (first, last, position), etc. arithmetic and string operations Implementations: in the context of XSLT processors Xalan, XT, MS Internet Explorer (IE6)
Core XPath simple location steps paper/title loc. steps with explicit axes paper/descendant::merlin qualifiers paper[…..] Boolean logic...[chandra and merlin and (not harel)] Full Xpath: node set comparisons and operations order functions (first, last), etc. arithmetic and string operations Implementations: Xalan, XT, MS Internet Explorer 6 (IE6) Complexity, efficiency? [G.,Koch,Pichler,VLDB 02]
Core Xpath on Xalan and XT Queries: a/b/parent::a/b/…parent::a/b exponential! Document:
Core Xpath on Microsoft IE6: polynomial combined complexity, quadratic data complexity quadratic
Full XPath on IE6: Exponential combined complexity! Exponential query complexity
Axes and regular expressions Observation: All XPath Axes can be expressed as regular expression of U -axes firstchild and nextsibling: child := firstchild.nextsibling* parent := (nextsibling -1 )*.firstchild -1 descendant := firstchild.(firstchild nextsibling)* etc … General Definition of “axis” : Relation definable via a regular expression (with inversion) from the primitive relations of U
Conjunctive queries with axes Evaluating conjunctive queries with axes over trees is NP-complete (query complexity) Theorem: CQ: conjunction of U -atoms and of atoms corresponding to derived axes Example : nextsibling(X,Z) & descendant(Z,U) & ancestor(U,V) & label a (V) & child(V,X) & (firstchild.firstchild firstchild -1 )(U,X)
Conjunctive queries with axes Evaluating conjunctive queries with axes over trees is NP-complete (query complexity) Theorem: CQ: conjunction of U -atoms and of atoms corresponding to derived axes Example : nextsibling(X,Z) & descendant(Z,U) & ancestor(U,V) & label a (V) & child(V,X) & (firstchild.firstchild firstchild -1 )(U,X) However: XPath more akin acyclic conjunctive queries!
Acyclic conjunctive queries with axes Evaluating acyclic conjunctive queries with axes over trees is feasible in time O(|data| * |query|) Theorem: Proof idea: translate acyclic qery into monadic datalog program over U descendant(X,Y) child(A,X) descendant(Y,Z) label a (Z) label b (Y)
Acyclic conjunctive queries with axes Evaluating acyclic conjunctive queries with axes over trees is feasible in time O(|data| * |query|) Theorem: Proof idea: translate acyclic qery into monadic datalog program over U descendant(X,Y) child(A,X) descendant(Y,Z) label a (Z) label b (Y) Ear atom which contains an ear variable that otherwise occurs in monadic atoms only. Is definable as (unary) MSO-query and thus expressible by a monadic datalog program.
Acyclic conjunctive queries with axes Evaluating acyclic conjunctive queries with axes over trees is feasible in time O(|data| * |query|) Theorem: Proof idea: translate acyclic qery into monadic datalog program over U descendant(X,Y) child(A,X) descendant(Y,Z) label a (Z) label b (Y) d(Y) <- firstchild(Y,Z) & aa(Z). aa(Z) label a (Z). aa(Z) aa(V) & nextsibling(Z,V). aa(Z) aa(V) & firstchild(Z,V)
Acyclic conjunctive queries with axes Evaluating acyclic conjunctive queries with axes over trees is feasible in time O(|data| * |query|) Theorem: Proof idea: translate acyclic qery into monadic datalog program over U descendant(X,Y) child(A,X) d(Y) label b (Y) d(Y) <- firstchild(Y,Z) & aa(Z). aa(Z) label a (Z). aa(Z) aa(V) & nextsibling(Z,V). aa(Z) aa(V) & firstchild(Z,V)
Acyclic conjunctive queries with axes Evaluating acyclic conjunctive queries with axes over trees is feasible in time O(|data| * |query|) Theorem: Proof idea: translate acyclic qery into monadic datalog program over U descendant(X,Y) child(A,X) d(Y) label b (Y) Ear atom. Continue eliminating ear atoms until query is entirely monadic.
Acyclic Monadic Datalog with Axes Evaluating AMX-datalog programs over trees is feasible in time O(|data| * |program|) Theorem: AMX-Datalog: monadic datalog programs whose rule bodies are acyclic and may contain arbitrary axes Remarks: Same bound for stratified AMX-Datalog AMX-Datalog expresses MSO over U (both without and with stratification)
Core XPath in Linear Time Evaluating core-XPath queries over trees is feasible in time O(|data| * |query|) Corollary: Proof: Linear translation from Core XPath to stratified Monadic Datalog + axes
Core XPath in Linear Time Evaluating core-XPath queries over trees is feasible in time O(|data| * |query|) Corollary: //paper[author[chandra and not merlin]]/title output(X) root(R) & descendant(R,P) & label paperr (P) & qual1(P) & child(P,X) & label title (X). qual1(X) child(X,Y) & label author (Y) & qual2(Y). qual2(X) child(X,Y) & label chandra (Y) & not qual3(X) qual3(X) child(X,M) & label merlin (M).
Full XPath in Polynomial Time Evaluating full XPath queries over XML documents is feasible in polynomial time (combined complexity) Theorem [G.,Koch,Pichler, VLDB 2002]: Proof: Extends the Logic Programming evaluation paradigm to all “nasty” features of full XPath. Implementation (main memory): XML-Taskforce XPath To our knowledge the only XPath system that scales.
Combined Complexity of XPath PODS’03, JACM’05
Data and Query Complexity Theorem. XPath is in L (data complexity). Theorem. PF is L-hard under NC1-reductions (data complexity). Theorem. XPath w/o multiplication, concatenation is in L w.r.t. query complexity. L L-complete (NC1-red.) XPath PF Data complexity
Core XPath and CTL Straightforward translation from Core XPath with vertical axes to CTL with past modalities. (On graphs with child relation – order independent!) //paper[author[chandra and merlin]]/title title & EX -1 (paper & EX(author & EXchandra & EXmerlin)) //title[parent::paper[author[chandra and merlin]]] first normalize to: Core XPath requires multimodal CTL: X , X , etc.
General conjunctive queries with axes We know they are NP-complete, but… Research programme: Find interesting sets of axes for which CQs are tractable. Trace the “tractablity frontier”, i.e., determine all maximal sets of axes for which CQs are tractable. Extend tractability results to datalog. PODS 2004: G.,Koch, Schulz: Solved for all XPath axes
Cyclic Query Example (from ComputationalLinguistics)
Complexity Results (Partition of set of axes!) (combined complexity)
Some simple tractability results: CQs with U -atoms and additional axe-sets {child} or {child +,child*} can be answered in time O(|data|*|query|). Proof idea for {child}: Cycles involving child: unsatisfiable (easy to check), or rewritable in linear time into acyclic CQs
Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c Data tree TCyclic query Q
Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c XYZU
Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c XYZU
Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c X ZU Y U must have an ancestor labeled b !
Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c X ZUZU Y ZUZUZU
Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c X Z Y ZZU Z must have U as “descendant-or-self”
Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c X Z Y ZZU
Proof idea for {child +,child*} Lemma: T | Q iff Reduct(Q,T) well-labeled X:a + Y:bZ:c * U:c * * a b c c c X Y ZU Reduct(Q,T) Locally arc-consistent! =
Proof idea for {child +,child*} Lemma: T | Q iff Reduct(Q,T) well-labeled X:a + Y:bZ:c * U:c * * a b c c c X Y ZU Reduct(Q,T) Locally arc-consistent! = morphism
Web wrapping Goal: Make web contents accessible to electronic data processing WEB HTML pages layout Corporate edp apps structured data, Databases, XML
Web wrapping WEB HTML pages layout Corporate edp apps structured data, Databases, XML WRAPPER Wrappers: select, extract, annotate Monadic deatalog ideally suited, but … whowannadoit? LiXto : a graphical wrapper generator for ELOG Goal: Make web contents accessible to electronic data processing
Degrees - Notebook - New 2.99 $ Notebook - Compaq Presario AU $ [...]
Web Extraction- Program ELOG Extraction Module XML Further processing: tracking changes, delivering ( ,sms)... (Infopipesystem) similarly structured pages Lixto Architecture Visual Wrapper Generator Example page(s)
Elog Program for eBay pages
Expressive power of LiXto ELOG - expresses monadic datalog Theorems [G., Koch PODS2002] All of ELOG - is graphically programmable via LiXto Elog - : Monadic kernel of Elog LiXto expresses all MSO wrapping tasks. Corollary:
Comparison to other Wrapper Generators Lixto more powerful than regular path queries Lixto more powerful than HEL (Sahuguet, Azavant) paper
Automated navigation to target pages Automated data extraction from target pages Automated data analysis, transformation & integration Automated data personalization Automated data delivery The Lixto Suite Visual Wrapper Transformation Server
Product Architecture LiXto Extraction Engine Transformation Server
Oracle 9 Marketing Department BI Tool Business Objects report Marketing & Business Intelligence
Major Customers of LiXto:
Oracle 9 Marketing Department BI Tool Business Objects report Marketing & Business Intelligence