Querying and storing XML

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

Configuration management
Provenance analysis of algorithms 10/1/13 V. Tannen University of Pennsylvania 1WebDam someTowards ?
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
The Volcano/Cascades Query Optimization Framework
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
An Overview of Data Provenance
Chapter 1: Data Models and DBMS Architecture Title: What Goes Around Comes Around Authors: M. Stonebraker, J. Hellerstein Pages: 2-40.
CS 4432query processing1 CS4432: Database Systems II.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania Principles of Provenance (PrOPr) Philadelphia, PA June 26, 2007.
Managing XML and Semistructured Data Lecture 14: Constraints and Keys Prof. Dan Suciu Spring 2001.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Rutgers University Relational Algebra 198:541 Rutgers University.
Relational Algebra Chapter 4 - part I. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
Propositional Calculus Math Foundations of Computer Science.
CSCD343- Introduction to databases- A. Vaisman1 Relational Algebra.
Graph Algebra with Pattern Matching and Aggregation Support 1.
Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.
Relational Algebra, R. Ramakrishnan and J. Gehrke (with additions by Ch. Eick) 1 Relational Algebra.
Query Processing Presented by Aung S. Win.
Class 6 Data and Business MIS 2000 Updated: September 2012.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
Comparing XSLT and XQuery Michael Kay XTech 2005.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
DBSQL 3-1 Copyright © Genetic Computer School 2009 Chapter 3 Relational Database Model.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 2: Intro to Relational.
Querying Structured Text in an XML Database By Xuemei Luo.
I Information Systems Technology Ross Malaga 4 "Part I Understanding Information Systems Technology" Copyright © 2005 Prentice Hall, Inc. 4-1 DATABASE.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ
SCUHolliday - COEN 17814–1 Schedule Today: u Query Processing overview.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
CS 4432query processing1 CS4432: Database Systems II Lecture #11 Professor Elke A. Rundensteiner.
1 Relational Algebra and Calculas Chapter 4, Part A.
Relational Algebra.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
Reconcilable Differences Todd J. GreenZachary G. IvesVal Tannen University of Pennsylvania March 24, ICDT 09, Saint Petersburg.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
CSCD34-Data Management Systems - A. Vaisman1 Relational Algebra.
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania PODS 2007.
Relational Algebra p BIT DBMS II.
Fall CSE330/CIS550: Introduction to Database Management Systems Prof. Susan Davidson Office: 278 Moore Office hours: TTh
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Welcome to CPSC 534B: Information Integration Laks V.S. Lakshmanan Rm. 315.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Chapter 14: Query Optimization
KEEPS – a system for UELMA preservation and security
Module 2: Intro to Relational Model
Relational Algebra Chapter 4 1.
Chapter 2: Intro to Relational Model
Relational Algebra Chapter 4, Part A
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Relational Algebra 1.
Relational Algebra Chapter 4 1.
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Chapter 2: Intro to Relational Model
Chapter 2: Intro to Relational Model
Example of a Relation attributes (or columns) tuples (or rows)
Chapter 2: Intro to Relational Model
Query Optimization.
Presentation transcript:

Querying and storing XML Week 8 Provenance March 12-15, 2013

What is provenance? Evidence of Origin History Authenticity Integrity Value

Why is provenance important for data? For traditional (paper) information: Creation process leaves “paper trail” Easier to detect modification, copying, forgery Can usually judge a book by its cover For electronic information: Often no such thing as a “bit trail” Easy to forge, plagiarize, alter data undetected Can't judge a database by its cover - there isn't one Provenance essential for judging quality of data

Provenance failures can be expensive

Especially important for scientific data

Provenance in Databases Provenance models extensively studied in relational databases Why-provenance Where-provenance How-provenance ....? Will examine provenance models for relational queries first following recent survey [Cheney, Chiticariu, Tan 2009]

Why-provenance (Buneman, Khanna, Tan 2001) Why-provenance: shows input data witnessing existence of output data R S R JOIN S A B C 1 2 3 4 C D 1 2 3 A B C D 1 2 3

Why-provenance (Buneman, Khanna, Tan 2001) Why-provenance: shows input data witnessing existence of output data = subset of input that is "enough" to generate output R S R JOIN S A B C 1 2 3 4 C D 1 2 3 A B C D 1 2 3

Why-provenance (Buneman, Khanna, Tan 2001) Why-provenance: shows input data witnessing existence of output data = subset of input that is "enough" to generate output R S R JOIN S A B C 1 2 3 4 C D 1 2 3 A B C D 1 2 3

Where-provenance (Buneman, Khanna, Tan 2001) Where-provenance: tracks where data in output comes from R S R JOIN S A B C 1 2 3 4 C D 1 2 3 A B C D 1 2 3

Where-provenance (Buneman, Khanna, Tan 2001) Where-provenance: tracks where data in output comes from R S R JOIN S A B C 1 2 3 4 C D 1 2 3 A B C D 1 2 3

Where-provenance (Buneman, Khanna, Tan 2001) Can think of provenance as "links" R S R JOIN S A B C 1 2 3 4 C D 1 2 3 A B C D 1 2 3

S C D 1 2 3 R' A B C 1 2 3 4 R A B C 1 2 5 6 3 4

Where-provenance (Buneman, Khanna, Tan 2001) Can think of provenance as "links" or propagated "annotations" R S R JOIN S A B C 1 2 3 4 C D 1 2 3 A B C D 1 2 3

Where-provenance (Buneman, Khanna, Tan 2001) Not invariant under query equivalence SELECT r.A,r.B,r.C,s.D FROM R r, S s WHEERE r.C = s.C R S A B C 1 2 3 4 C D 1 2 3 A B C D 1 2 3

Where-provenance (Buneman, Khanna, Tan 2001) Not invariant under query equivalence SELECT r.A,r.B,s.C,s.D FROM R r, S s WHEERE r.C = s.C R S A B C 1 2 3 4 C D 1 2 3 A B C D 1 2 3

Early work Definitions were very complicated

Early work Definitions were very complicated.

Early work Definitions were very complicated.

Ordinary relational algebra

Relational algebra Starting point: standard RA evaluate over tables of (named) records

Relational calculus Queries can also be written in a set- theoretic form { (x1,...,xn) | φ } where φ is a formula built from atomic formulas (tables), conjunction, disjunction, (negation) subject to some safety conditions to preserve finiteness of tables

Datalog Queries can also be written in a logical form called Datalog (subset of Prolog) A(x1,...,xn) :- R(y1,...,ym), ..., S(z1,...,zk) (subject to some restrictions...) Theorem: Relational algebra, relational calculus and nonrecursive Datalog are equally expressive

Example Two (equivalent) queries on a small table

Why-provenance [Buneman et al. 2001] Propagate sets of witnesses elements of {J ⊆ I | t ∈ Q(J)} Pairwise union of sets of justifications S ⋓ T = {J ∪ K | J ∈ S, K ∈ T}

Can recover by removing non-minimal Why-provenance Also sensitive to query rewriting Can recover by removing non-minimal witnesses

Where-provenance [Buneman et al. 2001] Propagate field-level annotation sets

Where-provenance May not be preserved by query equivalence

Provenance and XML Early work on provenance (why/where) focused on determinstic semistructured model Similar to (special case of) XML Advantages: XML more general; nodes easily addressed Complications: Little work on prov for XPath/XQuery, or other XML standards Next topic: provenance for updated data

Provenance for curated data

Curated databases Hi,everybody! Many bio-medical databases are curated data entered, checked manually high-quality but expensive provenance, versioning important lots of (re)implementation effort Hi,everybody!

Provenance Idea: Instead of trying to allow only "good" contributors allow anyone to contribute but record what they did Allows "auditing" after-the-fact can discard or approve changes May combine with access control allow retrospective analysis of trusted contributors

Copy-paste provenance ins copy del As data (tree) is updated, record "links" identifying "same" data in consecutive versions

Relational representation

Performance Isn't this expensive? Two optimizations: storing one edge per copied node Two optimizations: Hierarchical provenance: inheriting inferrable annotations Transactional provenance: storing only "diff" between "committed" versions, not intermediate steps

Hierarchical provenance ins copy del Infer that prov of child is child of prov Only store important (non-inferrable) edges

Transactional provenance ins copy del Require users to commit "checkpoints" (official versions) Concatenate edges between versions

Effect of optimizations Transactional Hierarchical Both

Queries Provenance queries are naturally recursive don't know how far back into history we need to look

Performance Query performance generally improves with H, T, HT storage strategy for H, this is somewhat surprising! Cheaper to recompute inferred links than to load

Generalizing to bulk updates [Buneman, Cheney & Vansummeren 2008] ICDT 2007/TODS 2008 R R' S C D 1 2 3 First, I want to give a bit of background, and describe some earlier work without going into details, to give a flavor of my approach and its motivation. In bioinformatics, there are many curated databases, which are maintained by expert scientists often by a mix of manual copying and pasting and bulk updates importing data from other sources. These scientists, called curators, also often manually record links to source data, to provide accountability and attribution. We showed how to automatically track copy-paste provenance links and store and query the provenance efficiently in a SIGMOD 2006 paper *. In later work, we generalized this approach to handle richer query and update languages *. In both approaches, each part of the output is linked to at most one "source" in the input, from which it was "copied", and we formally characterize this property. A B C 1 2 3 4 A B C 1 2 5 6 3 4 update R set (A,B) = (select S.C A, S.D B from S where S.A = 1) where R.C = 3

Database Wiki [Buneman, Cheney, Lindley, Müller, SIGMOD/SIGMOD Record 2011] Wiki-like Web application for data curation Archiving, copy- paste provenance "built-in" http://code.google.com/p/database-wiki/ Over the last two years, I have led a project to develop a practical system based on the copy-paste provenance model. The Database Wiki system is a Wiki-like Web interface for interactively editing structured data. It provides full versioning, copying and pasting from other data sources, records detailed change-history, and supports queries on both data and provenance. I obtained funding from IDEA Lab and Google for this work and it is publicly available as an open source project. * We have also explored collaboration opportunities with the IUPHAR-DB group to use DBWiki as an alternative web interface.

Provenance & annotation for XML queries

How-provenance (Green, Karvounakaris, Tannen 2007) How-provenance: shows how records were combined to form output SELECT A,B FROM R JOIN S R S A B C 1 2 a 3 b 4 c C D 1 2 x y 4 3 z A B 1 2 ax+by 3 cz

How-provenance (Green, Karvounakaris, Tannen 2007) How-provenance: shows how records were combined to form output SELECT A,B FROM R JOIN S R S A B C 1 2 a 3 b 4 c C D 1 2 x y 4 3 z A B 1 2 ax+by 3 cz

How-provenance (Green, Karvounakaris, Tannen 2007) How-provenance: shows how records were combined to form output SELECT A,B FROM R JOIN S R S A B C 1 2 a 3 b 4 c C D 1 2 x y 4 3 z A B 1 2 ax+by 3 cz

How-provenance (Green, Karvounakaris, Tannen 2007) How-provenance: shows how records were combined to form output SELECT A,B FROM R JOIN S R S A B C 1 2 a 3 b 4 c C D 1 2 x y 4 3 z A B 1 2 ax+by 3 cz

More about how-provenance Formalized using semiring-valued relations Idea: Each n-tuple in relation carries an annotation from a commutative semiring K = (K,0,1,+,*) is a commutative semiring if: (K,0,+) and (K,1,*) are commutative monoids a*0 = 0 (annihilation) a(b+c) = ab+ac (distributivity)

Some standard examples of semirings Booleans B = ({0,1},0,1,∨,⋀) Numbers N = ({0,1,...},0,1,+,∙) Free semiring ℕ[X] Polynomials over X with coefficients from N Formal addition, multiplication

Semiring-valued relational algebra I(R) is a function from tuples t to their annotations in K

Key observation When K = B, we get standard set-based semantics When K = ℕ, we get standard multiset semantics When K = ℕ[X], we get how-provenance semantics

Has Why, multiset semantics as instances How-provenance Preserves multiset, but not set semantics Has Why, multiset semantics as instances

Examples SELECT A,B FROM R JOIN S R S A B C C D A B Boolean semiring 1 2 T 3 4 C D 2 T 3 4 A B 1 2 T∧T∨T∧T 3 T∧T

Examples SELECT A,B FROM R JOIN S R S A B C C D A B Boolean semiring T 1 2 T 3 4 C D 2 T 3 4 A B 1 2 T 3

Examples SELECT A,B FROM R JOIN S R S A B C C D A B Fuzzy semiring 1 2 3 ? 4 C D 2 T 3 4 ? A B 1 2 T&T|?&T 3 T&?

Examples SELECT A,B FROM R JOIN S R S A B C C D A B Boolean semiring T 1 2 T 3 4 C D 2 T 3 4 F A B 1 2 T 3 ?

Examples SELECT A,B FROM R JOIN S R S A B C C D A B Natural numbers semiring SELECT A,B FROM R JOIN S R S A B C 1 2 3 4 C D 2 1 3 5 4 9 A B 1 2 1∙1+2∙5 3 3∙9

Examples SELECT A,B FROM R JOIN S R S A B C C D A B Natural numbers semiring SELECT A,B FROM R JOIN S R S A B C 1 2 3 4 C D 2 1 3 5 4 9 A B 1 2 11 3 27

Examples SELECT A,B FROM R JOIN S R S A B C C D A B Polynomial semiring SELECT A,B FROM R JOIN S R S A B C 1 2 a 3 b 4 c C D 2 x 3 y 4 z A B 1 2 ax+by 3 cz

One (semi) ring to rule them all The polynomial semiring is "most general" any other K-semantics is an instance SELECT A,B FROM R JOIN S R S A B C 1 2 3 4 A B C 1 2 a 3 b 4 c C D 2 1 3 5 4 9 C D 2 x 3 y 4 z A B 1 2 1∙1+2∙5 3 3∙9 A B 1 2 11 3 27 A B 1 2 ax+by 3 cz a=1,b=2,c=3 x=1,y=5,z=9

Observation SELECT A,B FROM R JOIN S R S A B C C D A B Why-provenance can be recovered as an instance of how-provenance. Idea: Take K = (P(P(X)), {}, {{}}, ⋓, U) SELECT A,B FROM R JOIN S R S A B C 1 2 {a} 3 {b} 4 {c} C D 2 {x} 3 {y} 4 {z} A B 1 2 {{a,x},{b,y}} 3 {{c,z}}

How-provenance for XML Consider unordered XQuery Evaluate over annotated (unordered) XML Each node of document has a semiring-valued annotation

<p>{$doc/*/*}</p> Example A a P b1 b2 B B C D b1c1+b2c3 b1c2 c1 c2 c3 C D D <p>{$doc/*/*}</p>

On the other hand... Semiring model is not the end of the story For example, where-provenance is not an instance of semiring model There are other non-instances. Only handles unordered XML also does not handle negation So, further generalization may be possible.

Provenance in other settings Scientific workflows/distributed computing Business process modeling Semantic Web Operating systems, file systems This work is generally not as formal not as clear what is implemented and why Understanding and relating these models is important future work

Summary of course Standards/languages for XML XPath/XQuery XSLT DTDs + XML Schema From XML to relations, and back XML shredding XML publishing

Summary of course Updates Types Provenance - today XQuery Update Updating XML stored in relations Types Regular expression types/XDuce XQuery typing, query/update independence Provenance - today

Presentations 10, 15, or 20 minutes (depending on group size) Each group member must participate Cover: background what you did (papers read, development) status; experimental results conclusions