XML Storage We must upgrade to XML. Everyone is talking about it. Well, that is going to cost us XXX on YYY and earn us WWW on ZZZ.

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

UFCE8V-20-3 Information Systems Development 3 (SHAPE HK) Lecture 12 Extensible Stylesheet Language Transformations : XSLT.
1 DTD (Document Type Definition) Imposing Structure on XML Documents (W3Schools on DTDs)W3Schools on DTDs.
CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27, Part D Based on slides by Dan Suciu University of.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
XML To Relational Model. Key Index – Forward Traversal Backward Traversal.
Storage of XML Data XML data can be stored in –Non-relational data stores Flat files –Natural for storing XML –But has all problems discussed in Chapter.
1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly.
Database Systems and XML David Wu CS 632 April 23, 2001.
Summary. Chapter 9 – Triggers Integrity constraints Enforcing IC with different techniques –Keys –Foreign keys –Attribute-based constraints –Schema-based.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
XML(EXtensible Markup Language). XML XML stands for EXtensible Markup Language. XML is a markup language much like HTML. XML was designed to describe.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
XML Technologies and Applications Rajshekhar Sunderraman Department of Computer Science Georgia State University Atlanta, GA 30302
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
4/20/2017.
XP New Perspectives on XML Tutorial 4 1 XML Schema Tutorial – Carey ISBN Working with Namespaces and Schemas.
Pemrograman Berbasis WEB XML part 2 -Aurelio Rahmadian- Sumber: w3cschools.com.
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
XML – what is it? eXtensible Markup Language Standard for publishing and interchange on the web and over the wire simpler version of SGML adapted to internet.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Document Type Definition.
XML Anisha K J Jerrin Thomas. Outline  Introduction  Structure of an XML Page  Well-formed & Valid XML Documents  DTD – Elements, Attributes, Entities.
Lecture 7 of Advanced Databases XML Querying & Transformation Instructor: Mr.Ahmed Al Astal.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
1 XML at a neighborhood university near you Innovation 2005 September 16, 2005 Kwok-Bun Yue University of Houston-Clear Lake.
XML in SQL Server Overview XML is a key part of any modern data environment It can be used to transmit data in a platform, application neutral form.
Lecture 6 of Advanced Databases XML Querying & Transformation Instructor: Mr.Eyad Almassri.
CISC 3140 (CIS 20.2) Design & Implementation of Software Application II Instructor : M. Meyer Address: Course Page:
SAX Parsing Presented by Clifford Lemoine CSC 436 Compiler Design.
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.
CODD’s 12 RULES OF RELATIONAL DATABASE
Company LOGO OODB and XML Database Management Systems – Fall 2012 Matthew Moccaro.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Intro to XML Originally Presented by Clifford Lemoine Modified by Box.
Computing & Information Sciences Kansas State University Thursday, 15 Mar 2007CIS 560: Database System Concepts Lecture 24 of 42 Thursday, 15 March 2007.
1 What Is XML? eXtensible Markup Language for data –Standard for publishing and interchange –“Cleaner” SGML for the Internet Applications: –Data exchange.
RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
XML Name: Niki Sardjono Class: CS 157A Instructor : Prof. S. M. Lee.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
1 Introduction to Semistructured Data and XML. 2 How the Web is Today  HTML documents often generated by applications consumed by humans only easy access:
XML Refresher Course Bálint Joó School of Physics University of Edinburgh May 02, 2003.
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
Computing & Information Sciences Kansas State University Friday, 20 Oct 2006CIS 560: Database System Concepts Lecture 24 of 42 Friday, 20 October 2006.
Martin Kruliš by Martin Kruliš (v1.1)1.
Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.
Dr. N. MamoulisAdvanced Database Technologies1 Topic 8: Semi-structured Data In various application domains, the data are semi-structured; the database.
Grouping Robin Burke ECT 360. Outline Extra credit Numbering, revisited Grouping: Sibling difference method Uniquifying in XPath Grouping: Muenchian method.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
CHAPTER NINE Accessing Data Using XML. McGraw Hill/Irwin ©2002 by The McGraw-Hill Companies, Inc. All rights reserved Introduction The eXtensible.
XML Extensible Markup Language
D Copyright © 2004, Oracle. All rights reserved. Using Oracle XML Developer’s Kit.
XML Storage. Suppose that we are given some XML documents How should they be stored? Why does it matter? –Storage implies which type of use can be efficiently.
1 XML eXtensible Markup Language. 2 Introduction and Motivation Dr. Praveen Madiraju Modified from Dr.Sagiv’s slides.
I Copyright © 2004, Oracle. All rights reserved. Introduction.
XML Databases Presented By: Pardeep MT15042 Anurag Goel MT15006.
XML Storage.
XML: Extensible Markup Language
Intro to XML.
XML in Web Technologies
Database Processing with XML
Storing and Querying XML Documents Without Using Schema Information
Presentation transcript:

XML Storage We must upgrade to XML. Everyone is talking about it. Well, that is going to cost us XXX on YYY and earn us WWW on ZZZ.

XML Topics Previous topics: –Motivation for XML –XML Syntax –DTDs –XPath This Week: XML Storage Upcoming Weeks: –Querying XML –XML Search –Advanced Topics (e.g., Web Services)

XML Storage Suppose that we are given some XML documents How should they be stored? Why does it matter? –Type of storage implies which type of use can be efficiently made of the XML –Type of usage determines which type of storage is needed Can’t really discuss using XML, without knowing how it is stored, and whether such usage is possible

3 Basic Strategies Files Relational Database Native XML Database What advantages do you think that each approach has? What disadvantages do you think that each approach has?

XML Files

Idea Store XML “as is”, in a file system –When querying, parse the document and traverse it to find the query answer Obvious Advantage: Simple storage system Obvious Disadvantage: –Must parse the XML document every time it is queried –Does not take advantage of indexes to quickly get to “interesting” elements (in order to reach a given element, must traverse everything appearing beforehand in the document)

Sample Document WEBM GE What must we read to be able to get information about the ticker element?

How is an XML document Parsed? Two basic types of parsers: –DOM parser: Creates a tree out of the document –SAX parser: Does not create any data structures. Notifies program for every element seen Both types of parsers have been standardized and have implementations in virtually every query language

DOM Parser DOM = Document Object Model Parser creates a tree object out of the document User accesses data by traversing the tree The API allows for constructing, accessing and manipulating the structure and content of XML documents

Document as Tree transaction account buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE GE exch NASDAQ Methods like: getRoot getChildren getAttributes etc.

Advantages and Disadvantages How would you answer a query like: –/transaction/buy –//ticker Advantages: –Natural and relatively easy to use –Can repeatedly query tree without reparsing Disadvantages: –High memory requirements – the whole document is kept in memory –Must parse the whole document and construct many objects before use

SAX Parser SAX = Simple API for XML Parser creates “events” (i.e., notifications) while traversing tree Goes through the document one time only

Document as Events WEBM GE Start tag: transaction Start tag: account Text: End tag: account Start tag: buy Attribute: shares Value: 100

Advantages and Disadvantages How would you answer a query like: –/transaction/buy –find accounts in which something is bought or sold from the NASDAQ Advantages: –Requires less memory –Fast Disadvantages: –Cannot read backwards

Storing XML in a Relational Database

Why? Relational databases have been developed for about 30 years There is extensive knowledge on how to use them efficiently Why not take advantage of this knowledge? Main Challenges: –get XML into database (inserting) –get XML out of database (querying)

Reminder Relational Database simply contains some tables Each table can have any number of columns (also called attributes) Data items in each column are atomic, i.e., single values A schema is a description of a set of tables, i.e., the table name and each table’s column names

Difficulties DTDs can be complex Modeling Mismatch –Conceptually, relational databases, i.e., tables, have 2 levels: tables and attributes –XML documents have arbitrary nesting XML documents can have set-valued attributes and recursion

Difficulties Relational Database System XML Translation Layer DTD Relational Schema Translation Information XML Documents Tuples XML Query SQL Query Relational Result XML Result

Relational Databases: Option 1 The Schema-less Case

Option 1: Store Tree Structure Bart Simpson 02 – – person name tel Bart Simpson 02 – –

Option 1: Store Tree Structure (cont.) 1. Assign each node a unique id 2. For each node, store type and value 3. For each node, store parent information person name tel Bart Simpson 02 – –

Option 1: Store Tree Structure (cont.) person name tel Bart Simpson 02 – – NodeTypeValueParentID 1elementpersonnull 6textBart Simpson2 ……

How Good Is This? Simple schema, can work with any document Translation from XML to tables is easy What about the translation back? –is this transformation lossless?

Answering XPath Queries Can you answer an XPath query that: –Just uses the Child axis, e.g., /a/b/c/d/e –Uses the Descendent axis at the beginning of the query, e.g., //a/b –Uses the Descendent axis in the middle of the query, e.g., /a/b//e –Uses the Following, Preceding, Following- Sibling axis?

Solving the Problem With the current modeling, it is not possible to evaluate many different types of steps of XPath queries To solve this problem, we: –number the nodes by DFS ordering –store, for each node, the id of its last descendent

phones person name tel Bart Simpson 02 – – NodeTypeValueParentIDLastDesc 1elementpersonnull10 4elementphones18 …… Can you answer these queries, now? these queries

Summary: Main Problems No convenient method to creating XML as output Each element in the path expression requires an additional join –Can become very expensive

Relational Databases: Option 2, Taking Advantage of DTDs Based On: Relational Databases for Querying XML Documents: Limitations and Opportunities By: Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton

Example XML The Selfish Gene Richard Dawkins Timbuktu Wouldn’t it be nice to store this as a table with the columns: booktitle author_id firstname lastname city zip

Example XML The Selfish Gene Richard Dawkins Timbuktu We can do this only if all XML documents that we will be considering follow this format. Otherwise, for example, what happens if there are 2 authors?

Considering the DTD If a DTD is given, then it defines what types of XML documents will be of interest Challenge: Given a DTD, find a relational schema such that ANY document conforming to the DTD can be stored in the relations –

Reducing the Complexity DTDs can be very complex Before translating a DTD to a relational schema, simplify the DTD Property of the Simplification: If D 2 is a simplification of D 1, then every document that conforms to D 1 also almost conforms to D 2 –almost means that it conforms, if the ordering of sub- elements is ignored

Simplification Rules (e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, …

(e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+)

(e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+?

(e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+?

(e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+? b??,c??,e??,e??,f*?

(e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+? b??,c??,e??,e??,f*? b?,c?,e?,e?,f*

(e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+? b??,c??,e??,e??,f*? b?,c?,e?,e?,f* b?,c?,e*,f*

You try it Can you simplify the expression –(b|c|e)?,(e?|(f?,(b,b)*))* (e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, …

DTD Graphs In order to describe a technique for converting a DTD to a schema it is convenient to first describe DTDs (or rather simplified DTDs) as graphs Its nodes are elements, attributes and operators in the DTD Each element appears exactly once in the graph Attributes and operators appear as many times as they are in the DTD Cycles indicate recursion

DTD Example

Corresponding DTD Graph

Creating the Schema: Shared Inline Technique When creating the schema for a DTD, we create a relation for: –each element with in-degree greater than 1 –each element with in-degree 0 –each element below a * –one element from each set of mutually recursive elements, having in-degree 1 All other elements are “inlined” into their parent’s relation (i.e., added into their parents relations) –Note that parent may also be inlined

Relations for which elements?

book (bookID: integer, book.booktitle : string) article (articleID: integer, article.contactauthor.authorid: string) monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.editor.name: string) title (titleID: integer, title: string, title.parentID: integer, title.parentCODE: integer) author (author.parentID: integer, author.parentCODE: integer, authorID: integer, author.authorid: string author.address: string, author.name.firstname: string, author.name.lastname: string, ) What are these for?

Advantages/Disadvantages Advantages: –Reduces number of joins for queries like “get the first and last names of an author” –Efficient for queries such as “list all authors with name Jack” Disadvantages: –Extra join needed for “Article with a given title name”

Notes Can/Should we use foreign keys to connect child tuples with their parents, e.g., titles with what they belong to? How can we answer queries, such as: –//title –//article/title –//article//name

Another Option: Hybrid Inlining Technique Same as Shared, except also inline elements with in-degree greater than one for the places in which they are not recursive or reached through a * node

What, in addition, will be inline?

book (bookID: integer, book.booktitle : string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string) article (articleID: integer, article.contactauthor.authorid: string, article.title: string) monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.title: string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string, monograph.editor.name: string, ) author (authorID: integer, author.parentID: integer, author.parentCODE: integer, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string) Why do we still have an author relation?

Advantages/Disadvantages Advantages: –Reduces joins through shared elements (that are not set or recursive elements) –Reduces joins for queries like “get first and last names of a book author” (like Shared) Disadvantages: –Requires more SQL sub-queries to retrieve all authors with first name Jack (i.e., unions) Tradeoff between reducing number of unions and reducing number of joins – Shared and Hybrid target union- and join-reduction, respectively

XML in Major Databases All major databases now have some level of support for XML Example: Oracle –XML data type (can have a column which contains XML documents) –XPath processing of XML values –Some indexing capabilities –XML is a second class citizen in the database (support consists of a bunch of tools – no coherent framework)