Representing and Querying XML with Incomplete Information Serge Abiteboul INRIA Luc Segoufin INRIA Victor Vianu UCSD.

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

12 Copyright © 2005, Oracle. All rights reserved. Query Rewrite.
Limitations of the relational model 1. 2 Overview application areas for which the relational model is inadequate - reasons drawbacks of relational DBMSs.
ICDT 2005 An Abstract Framework for Generating Maximal Answers to Queries Sara Cohen, Yehoshua Sagiv.
Automatic Verification Book: Chapter 6. How can we check the model? The model is a graph. The specification should refer the the graph representation.
CSCI 3130: Formal Languages and Automata Theory Tutorial 5
Querying on the Web: XQuery, RDQL, SparQL Semantic Web - Spring 2006 Computer Engineering Department Sharif University of Technology.
Complexity Classes: P and NP
CS 245Notes 141 CS 245: Database System Principles Notes 14: Coping with Limited Capabilities of Sources Hector Garcia-Molina.
XML May 3 rd, XQuery Based on Quilt (which is based on XML-QL) Check out the W3C web site for the latest. XML Query data model –Ordered !
1 Web Data Management Path Expressions. 2 In this lecture Path expressions Regular path expressions Evaluation techniques Resources: Data on the Web Abiteboul,
XML, XML Schema, Xpath and XQuery Slides collated from various sources, many from Dan Suciu at Univ. of Washington.
Containment of Nested XML Queries Xin (Luna) Dong, Alon Halevy, Igor Tatarinov University of Washington.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.
Xyleme A Dynamic Warehouse for XML Data of the Web.
25 nov 2001SDBI Sigalit Sima Sigalit Batiashvili Kdoshim Representing & Querying XML With Incomplete Information Serge Abiteboul Luc Segoufin Victor.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
The Data Ring: Community Content Sharing Serge Abiteboul (INRIA) Alkis Polyzotis (UC Santa Cruz)
Semistructured data -- June Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA
Query Languages Aswin Yedlapalli. XML Query data model Document is viewed as a labeled tree with nodes Successors of node may be : - an ordered sequence.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Containment and Equivalence for an XPath Fragment By Gerom e Mikla Dan Suciu Presented By Roy Ionas.
Managing XML and Semistructured Data Lecture 16: Indexes Prof. Dan Suciu Spring 2001.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina.
Managing XML and Semistructured Data
Inbal Yahav A Framework for Using Materialized XPath Views in XML Query Processing VLDB ‘04 DB Seminar, Spring 2005 By: Andrey Balmin Fatma Ozcan Kevin.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Sangam: A Transformation Modeling Framework Kajal T. Claypool (U Mass Lowell) and Elke A. Rundensteiner (WPI)
Change-Centric Management of Versions in an XML Warehouse Amélie Marian Columbia University Serge Abiteboul, Grégory Cobéna, Laurent Mignet INRIA-Rocquencourt.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
XML and XPath. Web Services: XML+XPath2 EXtensible Markup Language (XML) a W3C standard to complement HTML A markup language much like HTML origins: structured.
1 Maintaining Semantics in the Design of Valid and Reversible SemiStructured Views Yabing Chen, Tok Wang Ling, Mong Li Lee Department of Computer Science.
Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart BDA 2011.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina Fall 2006.
Query Processing In Multimedia Databases Dheeraj Kumar Mekala Devarasetty Bhanu Kiran.
Querying Structured Text in an XML Database By Xuemei Luo.
The Data Ring: Community Content Sharing Serge Abiteboul (INRIA) Alkis Polyzotis (UC Santa Cruz)
EXist Indexing Using the right index for you data Date: 9/29/2008 Dan McCreary President Dan McCreary & Associates (952) M.
[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.
More XML: semantics, DTDs, XPATH February 18, 2004.
XML and Database.
 Enhancing User Experience  Why it is important?  Discussing user experience one-by-one.
PRACTICAL KNOWLEDGE REPRESENTATION FOR THE WEB Frank van Harmelen Dieter Fensel AIFB Kim Kangil Structural Complexity Laboratory.
 Shopping Basket  Stages to maintain shopping basket in framework  Viewing Shopping Basket.
(A comparative study for XML change detection) Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST) Etude comparative sur la détection.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Exchange Intensional XML Data Tova MiloSerge Abiteboul Tova Milo INRIA & Tel-Aviv U. ; Serge Abiteboul INRIA ; Bernd AmannOmar Benjelloun Bernd Amann Cedric-CNAM.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Daniel Kroening and Ofer Strichman 1 Decision Procedures An Algorithmic Point of View Basic Concepts and Background.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
A formal study of collaborative access control in distributed datalog Serge Abiteboul – Inria & ENS Cachan Pierre Bourhis CNRS & Lille Univ. & Inria Victor.
XML Databases Presented By: Pardeep MT15042 Anurag Goel MT15006.
1 Representing and Reasoning on XML Documents: A Description Logic Approach D. Calvanese, G. D. Giacomo, M. Lenzerini Presented by Daisy Yutao Guo University.
Logic as a Query Language: from Frege to XML
XML path expressions CSE 350 Fall 2003.
Computing Full Disjunctions
Requirements – Scenarios and Use Cases
View and Index Selection Problem in Data Warehousing Environments
Semi-Structured data (XML Data MODEL)
Presented by: Jacky Ma Date: 11 Dec 2001
Lecture 9: XML Monday, October 17, 2005.
Probabilistic Databases
CSE591: Data Mining by H. Liu
Semi-Structured data (XML)
Lecture 11: XML and Semistructured Data
Introduction to XML IR XML Group.
Presentation transcript:

Representing and Querying XML with Incomplete Information Serge Abiteboul INRIA Luc Segoufin INRIA Victor Vianu UCSD

pods 2001Abiteboul-Segoufin-Vianu2 Organization Motivations Simplifying assumptions Model of incompleteness Answering queries Results Discussion Conclusion

Motivations

pods 2001Abiteboul-Segoufin-Vianu4 The Web is a world of incompleteness Information you get from the web is seldom complete: Queries return you some - not all - data Limited storage capability Documents change on the Web: expiration Sites are unavailable… Context: A warehouse of XML documents from the Web, Xyleme

pods 2001Abiteboul-Segoufin-Vianu5 This work This work: simple, practically appealing approach to managing incomplete information Sequence of queries to the web (q1,A1)+(q2,A2)+… Answers are cached Process a new query without access to the web Give an incomplete answer Explain incompleteness to user Seek additional information, i.e., find minimal set of queries to fully answer

pods 2001Abiteboul-Segoufin-Vianu6 Related works Semantic caching Answering queries using views keep (Q i,A i ) try to rewrite query Q into Q(A 1,...,A n ) reject if you cannot Incomplete database (Q i,A i ) is some incomplete knowledge of DB Related to querying incomplete information – e.g. Lipski-Imielinski

pods 2001Abiteboul-Segoufin-Vianu7 Challenge: balance expressiveness and tractability Choice of data model Choice of the query language Choice of a representation of incompleteness Results Simple, practical solution Extra features lead to serious problems

Simplifying Assumptions

pods 2001Abiteboul-Segoufin-Vianu9 Data is XML: trees dealer UsedCarsNewCars ad modelyear model Honda 96 Acura Honda 96Acura

pods 2001Abiteboul-Segoufin-Vianu10 Simplified XML =can =444 =electronique =nik =234 =electronic =camera=camera =c.jpg value function unordered trees name price cat picture catalogproduct subcategory product name price category subcategory labelling function

pods 2001Abiteboul-Segoufin-Vianu11 Simple XML types catalog product name price cat picture subcategory * * 1 : 1 child (default) * : 0 or more + : 1 or more ? : 0 or 1

pods 2001Abiteboul-Segoufin-Vianu12 Prefix Selection Queries (ps-queries) catalog product name price cat=elec subcategory <200 Query1 catalog product name Query2 picture

pods 2001Abiteboul-Segoufin-Vianu13 Simplifications Data No order No distinction attribute/element No recursion No links Query No complex path expressions No join No repeated child product name cat=elec cat=toy NO

pods 2001Abiteboul-Segoufin-Vianu14 Crucial assumption: XIDprod canon 120 elec camera &245prod&245c.jpg + c.jpgprod &245camera = URLs URLs ID/IDrefs ID/IDrefs

Representation of incomplete information: Incomplete trees

pods 2001Abiteboul-Segoufin-Vianu16 Document Type Definition (DTD) are used to represent incompleteness Set of rules: e r e element name r regular expression Set of trees satisfying a DTD d: tree(d) Shortcoming of DTDs An element has a single definition independently of the context Type of ad depends on the context dealer newxarusedcar adad modelyearmodel

pods 2001Abiteboul-Segoufin-Vianu17 Solution: specialization (decoupled tags) ad used and ad new h(ad used )=h(ad new )=ad dealer newxarusedcar ad ad new ad ad used modelyearmodel dealer newxarusedcar adad modelyearmodel h

pods 2001Abiteboul-Segoufin-Vianu18 DTDs + Specialization The sets of trees that can be specified: the regular unranked tree languages [BruggemanKlein+Murata+Wood] Same closure properties: intersection, union, complement Same complexity

pods 2001Abiteboul-Segoufin-Vianu19 Example Q1: name, subcat, price of electronic products with price less than $200 Q2: name, pictures of cameras at least pictured once Q3: name, price, pictures of cameras costing less than $100 and at least pictured once can be completely answered using A1, A2 Q4: list all cameras can be partially answered using A1, A2

pods 2001Abiteboul-Segoufin-Vianu20 catalog cdplayer product canon 120 elec camera product nikon 199 elec camera product sony 175 elec product1product2 * * Q1: name, subcat, price of electronic products with price less than 200 missing

pods 2001Abiteboul-Segoufin-Vianu21 Missing data after Q1 product1 name price cat picture subcategory * product2 subcategory * !=elec =elec >200

pods 2001Abiteboul-Segoufin-Vianu22 catalog product canon 120 elec camera product nikon 199 elec camera product sony 175 elec cdplayer product2a Q2: name, pictures of cameras at least pictured once product1 missing product2c product2** product2b * c.jpg akai a.jpg elec camera 33

pods 2001Abiteboul-Segoufin-Vianu23 Incomplete information Known information Prefix of the real data tree Missing information Extended tree type Conditions on data values Specializations, disjunctions

pods 2001Abiteboul-Segoufin-Vianu24 product1 name price cat picture subcategory * !=elec product2a name price cat picture subcategory =elec >200 name price cat product3 elec elec product2b name price cat picture * =elec >200 product2c name price cat subcategory =elec >200 subcategory !=camera subcategory !=camera no picture product + Known data Missing data

Answering Queries

pods 2001Abiteboul-Segoufin-Vianu26 Complete answer to Q3 Q3: name, price, pictures of cameras costing less than $150 and having at least one pictureQ3: name, price, pictures of cameras costing less than $150 and having at least one picture Can be fully answered using available informationCan be fully answered using available information Need to check whether answer is completeNeed to check whether answer is complete catalog prod canon 120 c.jpg

pods 2001Abiteboul-Segoufin-Vianu27 Incomplete answer to Q4 Provide known cameras Explain incompleteness canonnikonsony akai more products nameprice>200and no picture

pods 2001Abiteboul-Segoufin-Vianu28 Completing answer to Q4 It suffices to ask: product name price cat sub=camera =elec >200 picture 0

pods 2001Abiteboul-Segoufin-Vianu29 Revisit the types DTD Conditions Specialization: same element name may have several types Not sufficient Need to extend again the types: disjunctions product2b * =elec >200 subcategory !=camera name price cat picture

pods 2001Abiteboul-Segoufin-Vianu30 Disjunction ? ? vehicle data engine description sail vehicle data description vehicle data engine sailQuery1Query2vehicle data=…. description=…. Empty! &322

pods 2001Abiteboul-Segoufin-Vianu31 Disjunction continued Type of &322 vehicle1 + vehicle2 vehicle2 data description sail vehicle1 data engine description The type of &322 can not be described independently of that of data below

Results

pods 2001Abiteboul-Segoufin-Vianu33 Representation System: Lipskis+Imielinskis rep rep(T) Set of possible worlds q(rep(T))=rep(q(T)) q answers TRepresentation of information q(T) rep q Representation of result

pods 2001Abiteboul-Segoufin-Vianu34 Representation System for PS-queries Incomplete tree T to represent q 1 -1 (A 1 ) … q k -1 (A k ) PS-query q q(T) can be computed in ptime (representation of the answer can be computed in ptime)

pods 2001Abiteboul-Segoufin-Vianu35 Querying Incomplete Trees Given T and a query q, one can Give in ptime the sure answers up to our current knowledge Check in ptime whether query q can be fully anwered Generate in ptime queries to complete answer

pods 2001Abiteboul-Segoufin-Vianu36 Comparison with IL Relational model Relational calculus/algebra Conditional table Closed or open world Representation system XML tree model Weaker language (no join) Weaker system (no variable) + Closed and open World Representation system

pods 2001Abiteboul-Segoufin-Vianu37 Drawback: exponential blowup Incomplete information may become exponential w.r.t the sequence of query/answer q 1 /A 1 ;q 2 /A 2 … 11 qi:qi:qi:qi: Answers are empty database a=i b=i database a b Type:

pods 2001Abiteboul-Segoufin-Vianu38 Dealing with exponential blowup Make the representation more complex using disjunctions of types Size of representation stays polynomial Manipulations much more complex Restrict tree types and PS-queries Already very/too? simple Accept to loose some information Ask extra queries to simplify representation

Discussion

pods 2001Abiteboul-Segoufin-Vianu40 Discussion: extend language Some results in paper Extensions often lead to intractability E.G. : K-pebble transducers [Milo,Suciu,Vianu] that somehow subsume XML-QL and XSL No (known) representation system Testing rep(T) is empty is non-elementary

pods 2001Abiteboul-Segoufin-Vianu41 Discussion : node Ids Without node Ids much less information to integrate results more complex tedious case analysis

pods 2001Abiteboul-Segoufin-Vianu42 Discussion: ordering Ordering in XML, DTD, queries Problem is totally different and very complex Example: Q1/A1: list of males; Q2/A2: list of females; Q3: list all Depending on the type of input (Male)*(Female)* A3= A1 || A2 (Male Female)* A3= shuffle(A1,A2) (Male + Female)* we cannot answer A3 Regular expression processing

pods 2001Abiteboul-Segoufin-Vianu43 Conclusion Framework for acquiring, maintaining, querying incomplete XML data Limitations: simple queries no order and Id assumption small extensions lead to problems Possible to represent the incompleteness Possible to answer with incompleteness Possible to obtain queries to provide full answer