Presentation is loading. Please wait.

Presentation is loading. Please wait.

MONADIC QUERIES over TREE-STRUCTURED DATA Georg Gottlob TU Wien & Oxford University Joint work with Christoph Koch, Robert Baumgartner, and Marcus Herzog,

Similar presentations


Presentation on theme: "MONADIC QUERIES over TREE-STRUCTURED DATA Georg Gottlob TU Wien & Oxford University Joint work with Christoph Koch, Robert Baumgartner, and Marcus Herzog,"— Presentation transcript:

1

2 MONADIC QUERIES over TREE-STRUCTURED DATA Georg Gottlob TU Wien & Oxford University Joint work with Christoph Koch, Robert Baumgartner, and Marcus Herzog, and Reinhard Pichler

3 Talk Outline Semistructured data HTML, XML Monadic Queries Monadic datalog over trees Xpath Web information extraction (wrapping) Lixto

4 Strings, Trees, Graphs, & Logic Büchi: MSO=REG over strings Rabin: decidability of S2S Thatcher and Wright: MSO = REG over ranked trees (tree automata) Brüggemann-Klein/Wood/Murata: MSO = REG over unranked trees Fagin: ESO = NP Note: over graphs ESO NP-hard, MSO hard for Pol. Hierarchy. Grädel/Immerman/Vardi: ESO(Horn)=Datalog=LFP=PTIME (on ordered structures) Courcelle MSO in LinTime on tree-like structures (treewidth <= k) Clarke, Emerson, Pnueli, et al: CTL, LTL … A few well-known results:

5 Web documents are trees ! HTML: Hypertext Markup Language XML: Extensible Markup Language HTML, XML: Context free languages. Represent a document by its parse tree. Tags: vertex labels Labeled trees.

6 People @ DBAI Georg Gottlob gottlob@dbai.tuwien.ac.at 18420 Christoph Koch koch@dbai.tuwien.ac.at 18449 HTML Example Georg Gottlobgottlob@…18420 Christoph Kochkoch@…18449 People @ DBAI htmlbody table tr td tr td Christoph Koch Georg Gottlobgottlob@dbai.tuwien.ac.at koch@dbai.tuwien.ac.at 1844918420 h1

7 People @ DBAI Georg Gottlob gottlob@dbai.tuwien.ac.at 18420 Christoph Koch koch@dbai.tuwien.ac.at 18449 HTML Example Georg Gottlobgottlob@…18420 Christoph Kochkoch@…18449 People @ DBAI htmlbody table tr td tr td Christoph Koch Georg Gottlobgottlob@dbai.tuwien.ac.at koch@dbai.tuwien.ac.at 1844918420 h1

8 People @ DBAI Georg Gottlob gottlob@dbai.tuwien.ac.at 18420 Christoph Koch koch@dbai.tuwien.ac.at 18449 HTML Example Georg Gottlobgottlob@…18420 Christoph Kochkoch@…18449 People @ DBAI htmlbody table tr td tr td Christoph Koch Georg Gottlobgottlob@dbai.tuwien.ac.at koch@dbai.tuwien.ac.at 1844918420 h1

9 ……. paperDB paper author title chandramerlin“Conjunctive Queries” paper author title ……...... XML Example … …

10 paper author title chandramerlin“Conjunctive Queries” Ordered Trees as finite structures Child-relation is a priori unordered fc = first child ns = next sibling paper authortitle “Conj. Queries”chandramerlin fc ns fc ns

11 Core XPath  simple location steps paper/title  loc. steps with explicit axes paper/descendant::merlin  qualifiers paper[…..]  Boolean logic...[chandra and merlin and (not harel)] Full Xpath:  node set comparisons and operations  order functions (first, last, position), etc.  arithmetic and string operations Implementations: in the context of XSLT processors Xalan, XT, MS Internet Explorer (IE6)

12 XPath Examples /descendant::a/child::b a d a b b b c c c /descendant::a/child::b[ descendant::c and not(following-sibling::d)] a d a b b b c c c /descendant::a/child:b[ following-sibling::d] a d a b b b c c c

13 paper author title chandramerlin“Conjunctive Queries” Ordered Trees as finite structures Child-relation is a priori unordered fc = first child ns = next sibling paper authortitle “Conj. Queries”chandramerlin fc ns fc ns  U = aa

14 Monadic Queries over Trees Web Information Extraction (  later) Monadic XML Queries Select some nodes of a tree Unary query f: Trees  2 dom Select titles of articles authored by Chandra and Merlin No Joins or combinations of objects Yardstick: Monadic Second Order Logic (MSO) Two important applications:

15 Monadic Datalog over Trees Select titles of articles authored By Chandra and Merlin paper authortitle “Conj. Queries”chandramerlin fc ns fc ns paperDB fc paper ns

16 Monadic Datalog over Trees paper authortitle “Conj. Queries”chandramerlin fc ns fc ns paperDB fc paper ns paper(X)  root(R) & firstchild(R,X). paper(X)  paper(Y) & nextsibling(Y,X). output(X)  paper(P) & firstchild(P,A) & firstchild(A,Z) & label Chandra (Z) & nextsibling(Z,V) & label Merlin (V) & nextsibling(A,T) & firstchild(T,X). ns

17 How expressive is monadic Datalog? Over  U, Monadic Datalog = MSO It was known that:  Monadic Datalog   1 -MSO  Full Datalog = P Theorem [G. & Koch 2002]: A unary query is definable in MSO iff it is definable via a monadic datalog program.

18 Proof idea: Simulate Unranked Query Automata (UQA) by Neven and Schwentick in mon. Datalog UQA  Unary MSO Queries [Neven & Schwentick 01]

19 Example: “Even-query” Up transition Proof idea: Simulate Unranked Query Automata (UQA) by Neven and Schwentick in mon. Datalog

20 Example: “Even-query” 0010 Up transition 01 Proof idea: Simulate Unranked Query Automata (UQA) by Neven and Schwentick in mon. Datalog

21 Example: “Even-query” 0010 Up transition qodd(X) :- 0(Y), lastchild(X, Y). 01 Proof idea: Simulate Unranked Query Automata (UQA) by Neven and Schwentick in mon. Datalog

22 How complex is Monadic Datalog? Monadic Datalog over  U has combined complexity: O(|data| * |query|) Data Complexity: P-complete and linear-time. Theorem [G. & Koch 2002]: Previously known facts on full Datalog over Graphs:  Data Complexity of Datalog: P-complete (impl. in [Vardi 88])  Combined Complexity EXPTIME-complete (impl. [Vardi 88])  Comb. Compl. of sirups: EXPTIME-cplt. ([G. & Papadimitriou 99])

23 Proof idea: 1.) Transform datalog program + input tree in linear time into a “ground” propositional logic program Exploit functional dependencies: nextsibling(X,Y) has only a linear number of ground instances: nextsibling(n i,n j ), etc. Decouple independent atoms of rule bodies p(X)  q(X) & r(Y) & nextsibling(X,Z) & s(Z). p(X)  q(X) & r & nextsibling(X,Z) & s(Z). r  r(Y). 2.) Execute ground program in linear time by using well-known algorithms: [Dowling&Gallier] [Minoux]

24 XPath chandra paper authortitle “Conj. Queries”merlin fc ns fc ns paperDB fc Paper… ns //paper[author[chandra and merlin]]/title /descendant::paper[child::author[child::chandra and child::merlin]]/child::title Unabbreviated syntax with explicit axes: /descendant::chandra/following-sibling::merlin/ancestor::paper/child::title W3C-standard; kernel of XSLT, XQUERY, etc.

25 chandra paper authortitle “Conj. Queries”merlin fc ns fc ns paperDB fc Paper… ns desc. Core XPath: A tree morphism problem anc. child chandra root merlin paper title query tree w. location steps data tree foll-s. /descendant::chandra/following-sibling::merlin/ancestor::paper/child::title

26 chandra paper authortitle “Conj. Queries”merlin fc ns fc ns paperDB fc Paper… ns /descendant::chandra/nextsibling::merlin/ancestor::paper/child::title  desc. Core XPath: A tree morphism problem foll-s. anc. child chandra root merlin paper title query tree w. location steps data tree

27 Core XPath  simple location steps paper/title  loc. steps with explicit axes paper/descendant::merlin  qualifiers paper[…..]  Boolean logic...[chandra and merlin and (not harel)] Full Xpath:  node set comparisons and operations  order functions (first, last, position), etc.  arithmetic and string operations Implementations: in the context of XSLT processors Xalan, XT, MS Internet Explorer (IE6)

28 Core XPath  simple location steps paper/title  loc. steps with explicit axes paper/descendant::merlin  qualifiers paper[…..]  Boolean logic...[chandra and merlin and (not harel)] Full Xpath:  node set comparisons and operations  order functions (first, last), etc.  arithmetic and string operations Implementations: Xalan, XT, MS Internet Explorer 6 (IE6) Complexity, efficiency? [G.,Koch,Pichler,VLDB 02]

29 Core Xpath on Xalan and XT Queries: a/b/parent::a/b/…parent::a/b exponential! Document:

30 Core Xpath on Microsoft IE6: polynomial combined complexity, quadratic data complexity quadratic

31 Full XPath on IE6: Exponential combined complexity! Exponential query complexity

32 Axes and regular expressions Observation: All XPath Axes can be expressed as regular expression of  U -axes firstchild and nextsibling: child := firstchild.nextsibling* parent := (nextsibling -1 )*.firstchild -1 descendant := firstchild.(firstchild  nextsibling)* etc … General Definition of “axis” : Relation definable via a regular expression (with inversion) from the primitive relations of  U

33 Conjunctive queries with axes Evaluating conjunctive queries with axes over trees is NP-complete (query complexity) Theorem: CQ: conjunction of  U -atoms and of atoms corresponding to derived axes Example : nextsibling(X,Z) & descendant(Z,U) & ancestor(U,V) & label a (V) & child(V,X) & (firstchild.firstchild  firstchild -1 )(U,X)

34 Conjunctive queries with axes Evaluating conjunctive queries with axes over trees is NP-complete (query complexity) Theorem: CQ: conjunction of  U -atoms and of atoms corresponding to derived axes Example : nextsibling(X,Z) & descendant(Z,U) & ancestor(U,V) & label a (V) & child(V,X) & (firstchild.firstchild  firstchild -1 )(U,X) However: XPath more akin acyclic conjunctive queries!

35 Acyclic conjunctive queries with axes Evaluating acyclic conjunctive queries with axes over trees is feasible in time O(|data| * |query|) Theorem: Proof idea: translate acyclic qery into monadic datalog program over  U descendant(X,Y) child(A,X) descendant(Y,Z) label a (Z) label b (Y)

36 Acyclic conjunctive queries with axes Evaluating acyclic conjunctive queries with axes over trees is feasible in time O(|data| * |query|) Theorem: Proof idea: translate acyclic qery into monadic datalog program over  U descendant(X,Y) child(A,X) descendant(Y,Z) label a (Z) label b (Y) Ear atom which contains an ear variable that otherwise occurs in monadic atoms only. Is definable as (unary) MSO-query and thus expressible by a monadic datalog program.

37 Acyclic conjunctive queries with axes Evaluating acyclic conjunctive queries with axes over trees is feasible in time O(|data| * |query|) Theorem: Proof idea: translate acyclic qery into monadic datalog program over  U descendant(X,Y) child(A,X) descendant(Y,Z) label a (Z) label b (Y) d(Y) <- firstchild(Y,Z) & aa(Z). aa(Z)  label a (Z). aa(Z)  aa(V) & nextsibling(Z,V). aa(Z)  aa(V) & firstchild(Z,V)

38 Acyclic conjunctive queries with axes Evaluating acyclic conjunctive queries with axes over trees is feasible in time O(|data| * |query|) Theorem: Proof idea: translate acyclic qery into monadic datalog program over  U descendant(X,Y) child(A,X) d(Y) label b (Y) d(Y) <- firstchild(Y,Z) & aa(Z). aa(Z)  label a (Z). aa(Z)  aa(V) & nextsibling(Z,V). aa(Z)  aa(V) & firstchild(Z,V)

39 Acyclic conjunctive queries with axes Evaluating acyclic conjunctive queries with axes over trees is feasible in time O(|data| * |query|) Theorem: Proof idea: translate acyclic qery into monadic datalog program over  U descendant(X,Y) child(A,X) d(Y) label b (Y) Ear atom. Continue eliminating ear atoms until query is entirely monadic.

40 Acyclic Monadic Datalog with Axes Evaluating AMX-datalog programs over trees is feasible in time O(|data| * |program|) Theorem: AMX-Datalog: monadic datalog programs whose rule bodies are acyclic and may contain arbitrary axes Remarks: Same bound for stratified AMX-Datalog AMX-Datalog expresses MSO over  U (both without and with stratification)

41 Core XPath in Linear Time Evaluating core-XPath queries over trees is feasible in time O(|data| * |query|) Corollary: Proof: Linear translation from Core XPath to stratified Monadic Datalog + axes

42 Core XPath in Linear Time Evaluating core-XPath queries over trees is feasible in time O(|data| * |query|) Corollary: //paper[author[chandra and not merlin]]/title output(X)  root(R) & descendant(R,P) & label paperr (P) & qual1(P) & child(P,X) & label title (X). qual1(X)  child(X,Y) & label author (Y) & qual2(Y). qual2(X)  child(X,Y) & label chandra (Y) & not qual3(X) qual3(X)  child(X,M) & label merlin (M).

43 Full XPath in Polynomial Time Evaluating full XPath queries over XML documents is feasible in polynomial time (combined complexity) Theorem [G.,Koch,Pichler, VLDB 2002]: Proof: Extends the Logic Programming evaluation paradigm to all “nasty” features of full XPath. Implementation (main memory): XML-Taskforce XPath To our knowledge the only XPath system that scales.

44 Combined Complexity of XPath PODS’03, JACM’05

45 Data and Query Complexity Theorem. XPath is in L (data complexity). Theorem. PF is L-hard under NC1-reductions (data complexity). Theorem. XPath w/o multiplication, concatenation is in L w.r.t. query complexity. L L-complete (NC1-red.) XPath PF Data complexity

46 Core XPath and CTL Straightforward translation from Core XPath with vertical axes to CTL with past modalities. (On graphs with child relation – order independent!) //paper[author[chandra and merlin]]/title title & EX -1 (paper & EX(author & EXchandra & EXmerlin)) //title[parent::paper[author[chandra and merlin]]] first normalize to: Core XPath requires multimodal CTL: X , X , etc.

47 General conjunctive queries with axes We know they are NP-complete, but… Research programme: Find interesting sets of axes for which CQs are tractable. Trace the “tractablity frontier”, i.e., determine all maximal sets of axes for which CQs are tractable. Extend tractability results to datalog. PODS 2004: G.,Koch, Schulz: Solved for all XPath axes

48 Cyclic Query Example (from ComputationalLinguistics)

49 Complexity Results (Partition of set of axes!) (combined complexity)

50 Some simple tractability results: CQs with  U -atoms and additional axe-sets {child} or {child +,child*} can be answered in time O(|data|*|query|). Proof idea for {child}: Cycles involving child: unsatisfiable (easy to check), or rewritable in linear time into acyclic CQs

51 Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c Data tree TCyclic query Q

52 Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c XYZU

53 Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c XYZU

54 Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c X ZU Y U must have an ancestor labeled b !

55 Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c X ZUZU Y ZUZUZU

56 Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c X Z Y ZZU Z must have U as “descendant-or-self”

57 Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c X Z Y ZZU

58 Proof idea for {child +,child*} Lemma: T | Q iff Reduct(Q,T) well-labeled X:a + Y:bZ:c * U:c * * a b c c c X Y ZU Reduct(Q,T) Locally arc-consistent! =

59 Proof idea for {child +,child*} Lemma: T | Q iff Reduct(Q,T) well-labeled X:a + Y:bZ:c * U:c * * a b c c c X Y ZU Reduct(Q,T) Locally arc-consistent! = morphism

60 Web wrapping Goal: Make web contents accessible to electronic data processing WEB HTML pages layout Corporate edp apps structured data, Databases, XML

61 Web wrapping WEB HTML pages layout Corporate edp apps structured data, Databases, XML WRAPPER Wrappers: select, extract, annotate Monadic deatalog ideally suited, but … whowannadoit? LiXto : a graphical wrapper generator for ELOG Goal: Make web contents accessible to electronic data processing

62 409449118 98 Degrees - Notebook - New 2.99 $ - 413171469 Notebook - Compaq Presario 1207 730.00 AU $ [...]

63 Web Extraction- Program ELOG Extraction Module XML Further processing: tracking changes, delivering (email,sms)... (Infopipesystem) similarly structured pages Lixto Architecture Visual Wrapper Generator Example page(s)

64 Elog Program for eBay pages

65 Expressive power of LiXto ELOG - expresses monadic datalog Theorems [G., Koch PODS2002] All of ELOG - is graphically programmable via LiXto Elog - : Monadic kernel of Elog LiXto expresses all MSO wrapping tasks. Corollary:

66 Comparison to other Wrapper Generators Lixto more powerful than regular path queries Lixto more powerful than HEL (Sahuguet, Azavant)  paper

67 Automated navigation to target pages Automated data extraction from target pages Automated data analysis, transformation & integration Automated data personalization Automated data delivery The Lixto Suite Visual Wrapper Transformation Server

68 Product Architecture LiXto Extraction Engine Transformation Server

69 Oracle 9 Marketing Department BI Tool Business Objects report Marketing & Business Intelligence

70 Major Customers of LiXto:

71 Oracle 9 Marketing Department BI Tool Business Objects report Marketing & Business Intelligence


Download ppt "MONADIC QUERIES over TREE-STRUCTURED DATA Georg Gottlob TU Wien & Oxford University Joint work with Christoph Koch, Robert Baumgartner, and Marcus Herzog,"

Similar presentations


Ads by Google