MONADIC QUERIES over TREE-STRUCTURED DATA Georg Gottlob TU Wien & Oxford University Joint work with Christoph Koch, Robert Baumgartner, and Marcus Herzog,

Slides:



Advertisements
Similar presentations
1 Decidable Containment of Recursive Queries Diego Calvanese, Giuseppe De Giacomo, Moshe Y. Vardi presented by Axel Polleres
Advertisements

Querying on the Web: XQuery, RDQL, SparQL Semantic Web - Spring 2006 Computer Engineering Department Sharif University of Technology.
XML: Extensible Markup Language
XML DOCUMENTS AND DATABASES
Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.
TU Vienna dRDF: Entailment for Domain-Restricted RDF Reinhard Pichler 1 Axel Polleres 2 Fang Wei 1 Stefan.
Property testing of Tree Regular Languages Frédéric Magniez, LRI, CNRS Michel de Rougemont, LRI, University Paris II.
Lecture 23UofH - COSC Dr. Verma 1 COSC 3340: Introduction to Theory of Computation University of Houston Dr. Verma Lecture 23.
XPath Query Processing DBPL9 Tutorial, Sept. 8, 2003, Part 2 Georg Gottlob, TU Wien Christoph Koch, U. Edinburgh Based on joint work with R. Pichler.
On the Memory Requirements of XPath Evaluation over XML Streams Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.
Database Theory: Back to the Future Victor Vianu UC San Diego / INRIA.
The Tree-Width of auxiliary storage Gennaro Parlato (University of Southampton, UK) Joint work: P. Madhusudan – UIUC, USA.
XML Transformation Language Based on Monadic Second Order Logic Kazuhiro Inaba Haruo Hosoya University of Tokyo PLAN-X 2007.
The Tree-Width of automata with auxiliary storage Gennaro Parlato (LIAFA, CNRS, Paris, France) joint work with P. Madhusudan (Univ of Illinois at Urbana-Champaign,
1 Finite Model Theory Lecture 10 Second Order Logic.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,
G. Gottlob, C. Koch & R. Pichler TU Wien, Vienna, Austria Elias Politarhos Advanced Databases M.Sc. in Information Systems Athens University of Economics.
1 New Ways of Querying the Web by Eliahu Brodsky and Alina Blizhovsky.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
A Brief Survey of Web Data Extraction Tools (WDET) Laender et al.
© 2011 LogiGear Corporation. All Rights Reserved Capturing Interface Presenter: Thuy Tran.
4/20/2017.
Master Informatique 10/9/ Typing semistructured data Serge Abiteboul 2008 Typing semistructured data.
Another approach to Information Extraction Marek Nekvasil using Extended Ontologies.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
ColdFusion’s XML Capabilities Maryland CFUG April 12, 2005 Presented by Doug Ward.
1 Graph Query Verification using Monadic 2 nd -Order Logic Kazuhiro Inaba ( 稲葉 一浩 ) NII.ac.jp Oct 10, st PKU-NII International Joint Workshop.
Master Informatique 10/9/ Typing semistructured data Serge Abiteboul Web Data Management Typing semistructured data.
XML Typing and Query Evaluation. Plan We will put some formal model underlying XML Trees and queries on them – Keeping in mind the practical aspects but.
Model Checking Lecture 3 Tom Henzinger. Model-Checking Problem I |= S System modelSystem property.
Compact Representation for Answer Sets of n-ary Regular Queries by Kazuhiro Inaba (National Institute of Informatics, Japan) and Hauro Hosoya (The University.
Automatic Structures Bakhadyr Khoussainov Computer Science Department The University of Auckland, New Zealand.
1 Graph-Transformation Verification using Monadic 2 nd -Order Logic Kazuhiro Inaba with S. Hidaka, Z. Hu, H. Kato (National Institute of Informatics, Japan)
Copyright © 2004 Pearson Education, Inc.. Chapter 26 XML and Internet Databases.
Managing XML and Semistructured Data Lecture 13: XDuce and Regular Tree Languages Prof. Dan Suciu Spring 2001.
Management of XML and Semistructured Data Lecture 11: Schemas Wednesday, May 2nd, 2001.
Welcome.TU.code A project of the Faculty of Informatics and the Student body of the faculty.
Games, Logic and Automata Seminar Rotem Zach 1. Overview 2.
Document Databases for Information Management Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba
Overview of course CS598MP Spring’05. Modeling FSM, PDA Emptiness of PDA Games on FSMs Binary Decision Diagrams CTL bisimulations Mu-calculus Model-check.
Martin Kruliš by Martin Kruliš (v1.1)1.
Application Report: An extensible policy editing API for privacy and identity management policies Giles Hogben jrc. It European Commission.
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing.
Logics, automata and algorithms for graphs p. madhusudan (madhu) University of Illinois at Urbana-Champaign, USA.
S.1 May 21, 2005 Copyright A.R. Meyer 2005, all rights reserved A Supervisor’s Reminiscence What We Were Thinking Albert R. Meyer.
1 Finite Model Theory Lecture 12 Regular Expressions, FO k.
Welcome to CPSC 534B: Information Integration Laks V.S. Lakshmanan Rm. 315.
Complexity Classes Karl Lieberherr. Source From riptive_complexity.html.
1 Finite Model Theory Lecture 5 Turing Machines and Finite Models.
Tree Automata First: A reminder on Automata on words Typing semistructured data.
XSLT, XML Schema, and XPath Matt McClelland. Introduction XML Schema ▫Defines the content and structure of XML data. XSLT ▫Used to transform XML documents.
Extensions of Datalog Wednesday, February 13, 2001.
Christoph F. Eick: Final Words COSC Topics Covered in COSC 3480  Data models (ER, Relational, XML)  Using data models; learning how to store real.
1 Finite Model Theory Lecture 9 Logics and Complexity Classes (cont’d)
Answering pattern queries using views
Finite Model Theory Lecture 8
Management of XML and Semistructured Data
מערכות מסדי נתונים 1. הקדמה.
Zachary Cleaver Semantic Web.
Managing XML and Semistructured Data
Querying XML XPath.
Finite Model Theory Lecture 11
Querying XML XPath.
A Tight Lower Bound for Counting Hamiltonian Cycles via Matrix Rank
2/18/2019.
Answering Cross-Source Keyword Queries Over Biological Data Sources
Line Graphs.
Presentation transcript:

MONADIC QUERIES over TREE-STRUCTURED DATA Georg Gottlob TU Wien & Oxford University Joint work with Christoph Koch, Robert Baumgartner, and Marcus Herzog, and Reinhard Pichler

Talk Outline Semistructured data HTML, XML Monadic Queries Monadic datalog over trees Xpath Web information extraction (wrapping) Lixto

Strings, Trees, Graphs, & Logic Büchi: MSO=REG over strings Rabin: decidability of S2S Thatcher and Wright: MSO = REG over ranked trees (tree automata) Brüggemann-Klein/Wood/Murata: MSO = REG over unranked trees Fagin: ESO = NP Note: over graphs ESO NP-hard, MSO hard for Pol. Hierarchy. Grädel/Immerman/Vardi: ESO(Horn)=Datalog=LFP=PTIME (on ordered structures) Courcelle MSO in LinTime on tree-like structures (treewidth <= k) Clarke, Emerson, Pnueli, et al: CTL, LTL … A few well-known results:

Web documents are trees ! HTML: Hypertext Markup Language XML: Extensible Markup Language HTML, XML: Context free languages. Represent a document by its parse tree. Tags: vertex labels Labeled trees.

DBAI Georg Gottlob Christoph Koch HTML Example Georg Christoph DBAI htmlbody table tr td tr td Christoph Koch Georg h1

DBAI Georg Gottlob Christoph Koch HTML Example Georg Christoph DBAI htmlbody table tr td tr td Christoph Koch Georg h1

DBAI Georg Gottlob Christoph Koch HTML Example Georg Christoph DBAI htmlbody table tr td tr td Christoph Koch Georg h1

……. paperDB paper author title chandramerlin“Conjunctive Queries” paper author title …… XML Example … …

paper author title chandramerlin“Conjunctive Queries” Ordered Trees as finite structures Child-relation is a priori unordered fc = first child ns = next sibling paper authortitle “Conj. Queries”chandramerlin fc ns fc ns

Core XPath  simple location steps paper/title  loc. steps with explicit axes paper/descendant::merlin  qualifiers paper[…..]  Boolean logic...[chandra and merlin and (not harel)] Full Xpath:  node set comparisons and operations  order functions (first, last, position), etc.  arithmetic and string operations Implementations: in the context of XSLT processors Xalan, XT, MS Internet Explorer (IE6)

XPath Examples /descendant::a/child::b a d a b b b c c c /descendant::a/child::b[ descendant::c and not(following-sibling::d)] a d a b b b c c c /descendant::a/child:b[ following-sibling::d] a d a b b b c c c

paper author title chandramerlin“Conjunctive Queries” Ordered Trees as finite structures Child-relation is a priori unordered fc = first child ns = next sibling paper authortitle “Conj. Queries”chandramerlin fc ns fc ns  U = aa

Monadic Queries over Trees Web Information Extraction (  later) Monadic XML Queries Select some nodes of a tree Unary query f: Trees  2 dom Select titles of articles authored by Chandra and Merlin No Joins or combinations of objects Yardstick: Monadic Second Order Logic (MSO) Two important applications:

Monadic Datalog over Trees Select titles of articles authored By Chandra and Merlin paper authortitle “Conj. Queries”chandramerlin fc ns fc ns paperDB fc paper ns

Monadic Datalog over Trees paper authortitle “Conj. Queries”chandramerlin fc ns fc ns paperDB fc paper ns paper(X)  root(R) & firstchild(R,X). paper(X)  paper(Y) & nextsibling(Y,X). output(X)  paper(P) & firstchild(P,A) & firstchild(A,Z) & label Chandra (Z) & nextsibling(Z,V) & label Merlin (V) & nextsibling(A,T) & firstchild(T,X). ns

How expressive is monadic Datalog? Over  U, Monadic Datalog = MSO It was known that:  Monadic Datalog   1 -MSO  Full Datalog = P Theorem [G. & Koch 2002]: A unary query is definable in MSO iff it is definable via a monadic datalog program.

Proof idea: Simulate Unranked Query Automata (UQA) by Neven and Schwentick in mon. Datalog UQA  Unary MSO Queries [Neven & Schwentick 01]

Example: “Even-query” Up transition Proof idea: Simulate Unranked Query Automata (UQA) by Neven and Schwentick in mon. Datalog

Example: “Even-query” 0010 Up transition 01 Proof idea: Simulate Unranked Query Automata (UQA) by Neven and Schwentick in mon. Datalog

Example: “Even-query” 0010 Up transition qodd(X) :- 0(Y), lastchild(X, Y). 01 Proof idea: Simulate Unranked Query Automata (UQA) by Neven and Schwentick in mon. Datalog

How complex is Monadic Datalog? Monadic Datalog over  U has combined complexity: O(|data| * |query|) Data Complexity: P-complete and linear-time. Theorem [G. & Koch 2002]: Previously known facts on full Datalog over Graphs:  Data Complexity of Datalog: P-complete (impl. in [Vardi 88])  Combined Complexity EXPTIME-complete (impl. [Vardi 88])  Comb. Compl. of sirups: EXPTIME-cplt. ([G. & Papadimitriou 99])

Proof idea: 1.) Transform datalog program + input tree in linear time into a “ground” propositional logic program Exploit functional dependencies: nextsibling(X,Y) has only a linear number of ground instances: nextsibling(n i,n j ), etc. Decouple independent atoms of rule bodies p(X)  q(X) & r(Y) & nextsibling(X,Z) & s(Z). p(X)  q(X) & r & nextsibling(X,Z) & s(Z). r  r(Y). 2.) Execute ground program in linear time by using well-known algorithms: [Dowling&Gallier] [Minoux]

XPath chandra paper authortitle “Conj. Queries”merlin fc ns fc ns paperDB fc Paper… ns //paper[author[chandra and merlin]]/title /descendant::paper[child::author[child::chandra and child::merlin]]/child::title Unabbreviated syntax with explicit axes: /descendant::chandra/following-sibling::merlin/ancestor::paper/child::title W3C-standard; kernel of XSLT, XQUERY, etc.

chandra paper authortitle “Conj. Queries”merlin fc ns fc ns paperDB fc Paper… ns desc. Core XPath: A tree morphism problem anc. child chandra root merlin paper title query tree w. location steps data tree foll-s. /descendant::chandra/following-sibling::merlin/ancestor::paper/child::title

chandra paper authortitle “Conj. Queries”merlin fc ns fc ns paperDB fc Paper… ns /descendant::chandra/nextsibling::merlin/ancestor::paper/child::title  desc. Core XPath: A tree morphism problem foll-s. anc. child chandra root merlin paper title query tree w. location steps data tree

Core XPath  simple location steps paper/title  loc. steps with explicit axes paper/descendant::merlin  qualifiers paper[…..]  Boolean logic...[chandra and merlin and (not harel)] Full Xpath:  node set comparisons and operations  order functions (first, last, position), etc.  arithmetic and string operations Implementations: in the context of XSLT processors Xalan, XT, MS Internet Explorer (IE6)

Core XPath  simple location steps paper/title  loc. steps with explicit axes paper/descendant::merlin  qualifiers paper[…..]  Boolean logic...[chandra and merlin and (not harel)] Full Xpath:  node set comparisons and operations  order functions (first, last), etc.  arithmetic and string operations Implementations: Xalan, XT, MS Internet Explorer 6 (IE6) Complexity, efficiency? [G.,Koch,Pichler,VLDB 02]

Core Xpath on Xalan and XT Queries: a/b/parent::a/b/…parent::a/b exponential! Document:

Core Xpath on Microsoft IE6: polynomial combined complexity, quadratic data complexity quadratic

Full XPath on IE6: Exponential combined complexity! Exponential query complexity

Axes and regular expressions Observation: All XPath Axes can be expressed as regular expression of  U -axes firstchild and nextsibling: child := firstchild.nextsibling* parent := (nextsibling -1 )*.firstchild -1 descendant := firstchild.(firstchild  nextsibling)* etc … General Definition of “axis” : Relation definable via a regular expression (with inversion) from the primitive relations of  U

Conjunctive queries with axes Evaluating conjunctive queries with axes over trees is NP-complete (query complexity) Theorem: CQ: conjunction of  U -atoms and of atoms corresponding to derived axes Example : nextsibling(X,Z) & descendant(Z,U) & ancestor(U,V) & label a (V) & child(V,X) & (firstchild.firstchild  firstchild -1 )(U,X)

Conjunctive queries with axes Evaluating conjunctive queries with axes over trees is NP-complete (query complexity) Theorem: CQ: conjunction of  U -atoms and of atoms corresponding to derived axes Example : nextsibling(X,Z) & descendant(Z,U) & ancestor(U,V) & label a (V) & child(V,X) & (firstchild.firstchild  firstchild -1 )(U,X) However: XPath more akin acyclic conjunctive queries!

Acyclic conjunctive queries with axes Evaluating acyclic conjunctive queries with axes over trees is feasible in time O(|data| * |query|) Theorem: Proof idea: translate acyclic qery into monadic datalog program over  U descendant(X,Y) child(A,X) descendant(Y,Z) label a (Z) label b (Y)

Acyclic conjunctive queries with axes Evaluating acyclic conjunctive queries with axes over trees is feasible in time O(|data| * |query|) Theorem: Proof idea: translate acyclic qery into monadic datalog program over  U descendant(X,Y) child(A,X) descendant(Y,Z) label a (Z) label b (Y) Ear atom which contains an ear variable that otherwise occurs in monadic atoms only. Is definable as (unary) MSO-query and thus expressible by a monadic datalog program.

Acyclic conjunctive queries with axes Evaluating acyclic conjunctive queries with axes over trees is feasible in time O(|data| * |query|) Theorem: Proof idea: translate acyclic qery into monadic datalog program over  U descendant(X,Y) child(A,X) descendant(Y,Z) label a (Z) label b (Y) d(Y) <- firstchild(Y,Z) & aa(Z). aa(Z)  label a (Z). aa(Z)  aa(V) & nextsibling(Z,V). aa(Z)  aa(V) & firstchild(Z,V)

Acyclic conjunctive queries with axes Evaluating acyclic conjunctive queries with axes over trees is feasible in time O(|data| * |query|) Theorem: Proof idea: translate acyclic qery into monadic datalog program over  U descendant(X,Y) child(A,X) d(Y) label b (Y) d(Y) <- firstchild(Y,Z) & aa(Z). aa(Z)  label a (Z). aa(Z)  aa(V) & nextsibling(Z,V). aa(Z)  aa(V) & firstchild(Z,V)

Acyclic conjunctive queries with axes Evaluating acyclic conjunctive queries with axes over trees is feasible in time O(|data| * |query|) Theorem: Proof idea: translate acyclic qery into monadic datalog program over  U descendant(X,Y) child(A,X) d(Y) label b (Y) Ear atom. Continue eliminating ear atoms until query is entirely monadic.

Acyclic Monadic Datalog with Axes Evaluating AMX-datalog programs over trees is feasible in time O(|data| * |program|) Theorem: AMX-Datalog: monadic datalog programs whose rule bodies are acyclic and may contain arbitrary axes Remarks: Same bound for stratified AMX-Datalog AMX-Datalog expresses MSO over  U (both without and with stratification)

Core XPath in Linear Time Evaluating core-XPath queries over trees is feasible in time O(|data| * |query|) Corollary: Proof: Linear translation from Core XPath to stratified Monadic Datalog + axes

Core XPath in Linear Time Evaluating core-XPath queries over trees is feasible in time O(|data| * |query|) Corollary: //paper[author[chandra and not merlin]]/title output(X)  root(R) & descendant(R,P) & label paperr (P) & qual1(P) & child(P,X) & label title (X). qual1(X)  child(X,Y) & label author (Y) & qual2(Y). qual2(X)  child(X,Y) & label chandra (Y) & not qual3(X) qual3(X)  child(X,M) & label merlin (M).

Full XPath in Polynomial Time Evaluating full XPath queries over XML documents is feasible in polynomial time (combined complexity) Theorem [G.,Koch,Pichler, VLDB 2002]: Proof: Extends the Logic Programming evaluation paradigm to all “nasty” features of full XPath. Implementation (main memory): XML-Taskforce XPath To our knowledge the only XPath system that scales.

Combined Complexity of XPath PODS’03, JACM’05

Data and Query Complexity Theorem. XPath is in L (data complexity). Theorem. PF is L-hard under NC1-reductions (data complexity). Theorem. XPath w/o multiplication, concatenation is in L w.r.t. query complexity. L L-complete (NC1-red.) XPath PF Data complexity

Core XPath and CTL Straightforward translation from Core XPath with vertical axes to CTL with past modalities. (On graphs with child relation – order independent!) //paper[author[chandra and merlin]]/title title & EX -1 (paper & EX(author & EXchandra & EXmerlin)) //title[parent::paper[author[chandra and merlin]]] first normalize to: Core XPath requires multimodal CTL: X , X , etc.

General conjunctive queries with axes We know they are NP-complete, but… Research programme: Find interesting sets of axes for which CQs are tractable. Trace the “tractablity frontier”, i.e., determine all maximal sets of axes for which CQs are tractable. Extend tractability results to datalog. PODS 2004: G.,Koch, Schulz: Solved for all XPath axes

Cyclic Query Example (from ComputationalLinguistics)

Complexity Results (Partition of set of axes!) (combined complexity)

Some simple tractability results: CQs with  U -atoms and additional axe-sets {child} or {child +,child*} can be answered in time O(|data|*|query|). Proof idea for {child}: Cycles involving child: unsatisfiable (easy to check), or rewritable in linear time into acyclic CQs

Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c Data tree TCyclic query Q

Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c XYZU

Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c XYZU

Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c X ZU Y U must have an ancestor labeled b !

Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c X ZUZU Y ZUZUZU

Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c X Z Y ZZU Z must have U as “descendant-or-self”

Proof idea for {child +,child*} X:a + Y:bZ:c * U:c * * a b c c c X Z Y ZZU

Proof idea for {child +,child*} Lemma: T | Q iff Reduct(Q,T) well-labeled X:a + Y:bZ:c * U:c * * a b c c c X Y ZU Reduct(Q,T) Locally arc-consistent! =

Proof idea for {child +,child*} Lemma: T | Q iff Reduct(Q,T) well-labeled X:a + Y:bZ:c * U:c * * a b c c c X Y ZU Reduct(Q,T) Locally arc-consistent! = morphism

Web wrapping Goal: Make web contents accessible to electronic data processing WEB HTML pages layout Corporate edp apps structured data, Databases, XML

Web wrapping WEB HTML pages layout Corporate edp apps structured data, Databases, XML WRAPPER Wrappers: select, extract, annotate Monadic deatalog ideally suited, but … whowannadoit? LiXto : a graphical wrapper generator for ELOG Goal: Make web contents accessible to electronic data processing

Degrees - Notebook - New 2.99 $ Notebook - Compaq Presario AU $ [...]

Web Extraction- Program ELOG Extraction Module XML Further processing: tracking changes, delivering ( ,sms)... (Infopipesystem) similarly structured pages Lixto Architecture Visual Wrapper Generator Example page(s)

Elog Program for eBay pages

Expressive power of LiXto ELOG - expresses monadic datalog Theorems [G., Koch PODS2002] All of ELOG - is graphically programmable via LiXto Elog - : Monadic kernel of Elog LiXto expresses all MSO wrapping tasks. Corollary:

Comparison to other Wrapper Generators Lixto more powerful than regular path queries Lixto more powerful than HEL (Sahuguet, Azavant)  paper

Automated navigation to target pages Automated data extraction from target pages Automated data analysis, transformation & integration Automated data personalization Automated data delivery The Lixto Suite Visual Wrapper Transformation Server

Product Architecture LiXto Extraction Engine Transformation Server

Oracle 9 Marketing Department BI Tool Business Objects report Marketing & Business Intelligence

Major Customers of LiXto:

Oracle 9 Marketing Department BI Tool Business Objects report Marketing & Business Intelligence