Storing and Querying XML Documents Without Using Schema Information

Slides:



Advertisements
Similar presentations
XML DOCUMENTS AND DATABASES
Advertisements

XML May 3 rd, XQuery Based on Quilt (which is based on XML-QL) Check out the W3C web site for the latest. XML Query data model –Ordered !
By Daniela Floresu Donald Kossmann
Min LuTIMBER: A Native XML DB1 TIMBER: A Native XML Database Author: H.V. Jagadish, etc. Presenter: Min Lu Date: Apr 5, 2005.
CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.
DBLABNational Taiwan Ocean University1/35 A Document-based Approach to Indexing XML Data Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science.
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Storing and Querying Ordered XML Using Relational Database System Swapna Dhayagude.
XML To Relational Model. Key Index – Forward Traversal Backward Traversal.
Storage of XML Data XML data can be stored in –Non-relational data stores Flat files –Natural for storing XML –But has all problems discussed in Chapter.
Database Systems and XML David Wu CS 632 April 23, 2001.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach VLDB, 2007 Oct 15, 2014.
Course Introduction Introduction to Databases Instructor: Joe Bockhorst University of Wisconsin - Milwaukee.
XML BIS4430 – unit 10. XML Origins Extensible Markup Language (XML) 1998 Inspired by Standard Generalized Markup Language (SGML) and HTML. SGML defines.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,
© Dennis Shasha, Philippe Bonnet – 2013 Communicating with the Outside.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
Reaching out… through IT R Document Store - Pilot 001 Presented to.
Intro – Part 2 Introduction to Database Management: Ch 1 & 2.
Lecture # 3 & 4 Chapter # 2 Database System Concepts and Architecture Muhammad Emran Database Systems 1.
Clustering XML Documents for Query Performance Enhancement Wang Lian.
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Keyword Query Routing.
An Efficient Inverted Index Technique for XML Documents using RDBMS Prepared by Devrim Yıldırım Original paper by Chiyoung Seo.
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Computing & Information Sciences Kansas State University Friday, 20 Oct 2006CIS 560: Database System Concepts Lecture 24 of 42 Friday, 20 October 2006.
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
Dr. N. MamoulisAdvanced Database Technologies1 Topic 8: Semi-structured Data In various application domains, the data are semi-structured; the database.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
©2007 Really Strategies, Inc. CONFIDENTIAL 1 Native XML Content Management Philadelphia XML Users’ Group.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Database Overview What is a database? What types of databases are there? How are databases more powerful than spreadsheets?
XML Storage We must upgrade to XML. Everyone is talking about it. Well, that is going to cost us XXX on YYY and earn us WWW on ZZZ.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Introduction to Database Programming with Python Gary Stewart
Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.
Databases and DBMSs Todd S. Bacastow January
NoSQL: Graph Databases
Database Management Systems
Compressing XML Documents with Finite State Automata
CS 540 Database Management Systems
CS422 Principles of Database Systems Course Overview
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Fundamentals & Ethics of Information Systems IS 201
Database Management System (DBMS)
Introduction to Database Management System
Data Base System Lecture : Database Environment
MANAGING DATA RESOURCES
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Joining Interval Data in Relational Databases
Introduction to Databases Transparencies
MANAGING DATA RESOURCES
Database management concepts
Wednesday, May 29, 2002 XML Storage Final Review
Wednesday, May 22, 2002 XML Publishing, Storage
The Gamma Database Machine Project
XML and Web Services (II/2546)
Presentation transcript:

Storing and Querying XML Documents Without Using Schema Information Kanda Runapongsa Department of Computer Engineering Khon Kaen University, Thailand Jignesh M. Patel Department of EECS University of Michigan, USA

Sample XML Data <bib> <book> <author>John</author> <publisher>ABC</publisher> </book> <article> <author>Brown</author> <author>Smith</author> </article> </bib>

Motivation The amount of XML data is increasing rapidly Enterprise application integration B2B interchange Web services Efficient tools for managing large XML data sets are urgently needed Solutions range from using native XML DBMSs to relational DBMSs

Pros & Cons of Using an RDBMS Advantages of using an RDBMS Well developed query optimization techniques Effective storage and indexing mechanisms Scalability, parallelism, and distributed processing Concurrency and recovery Disadvantages of using an RDBMS Need to transform between XML data and relational data Need to perform multiple joins between tables This work: try to improve the performance of an RDBMS

Queries on XML Typical inquiries on XML are containment queries Direct containment query: query that tests whether element c is a child of element p Example: p/c Indirect containment query: query that tests whether element d is a descendant of element a Example: a//d How should we process these queries?

XML Query Processing (Using Schema) * book article + author publisher bib Using a DTD graph which represents the structure of the DTD Nodes corresponds to Elements Example: book Attributes Operators Example: *

XML Query Processing (Using Schema) Legend Rules for mapping nodes to table/attributes in an RDBMS 1. Create tables for these following nodes: (C1): no incoming link (C2): below a ‘*’ node or a ‘+’ node 2. Inline all remaining nodes as table attributes bib table attribute * * book article + + author publisher

XML Query Processing (Not Using Schema) bib (begin, end, level) Q: book/author A: (3,5,3) Containment query: d is contained in a iff abegin < dbegin && dend < aend d is directly contained in a if dlevel = alevel – 1 book/author 2 < 3 && 5 < 9 && 3 = 2 - 1 (1,18,1) (10,17,2) (2,9,2) book article (3,5,3) (11,13,3) author author (6,8,3) (14,16,3) author publisher John Brown Smith ABC (4,4,4) (12,12,4) (15,15,4) (7,7,4)

This Proposed Solution (PAID) Use a numbering scheme (begin, end, level) to solve the direct and indirect containment queries because it is applicable even when there is no schema information Store the path information of the node to solve a long path query, such as a/b/c/d/e Store the position of the parent node for each node to quickly establish parent-child relationship (direct containment) between any two given nodes

Storing Node Information in Tables element <author> docID term pathID begin end level parentID 1 author 3 5 2 6 11 13 10 14 16 (3,5,3) (11,13,3) (14,16,3) John text term docID wordno level parentID John 1 4 3 (4,4,4) path pathExp pathID /bib/book/author 3 /bib/article/author 6

Other Mapping Approaches The BEL approach [ZND+01] Has begin, end, level information But no path and parent ID information Store a single word on each tuple The BELP approach [SYU99] Has begin, end, level information (but stored as float) Has path information, but no parentID information Store multiple words on each tuple

SQL Queries using PAID book/author select * from element b, element a where b.term = ‘book’ and a.term = ‘author’ and b.docID = a.docID and a.parentID = b.begin bib//author select * from element b, element a where b.term = ‘bib’ and a.term = ‘author’ and b.docID = a.docID and b.begin < a.begin and a.end < b.end

SQL Queries using Other Approaches BEL: book/author select * from element b, element a where b.term = ‘book’ and a.term = ‘author’ and b.docID = a.docID and b.begin < a.begin and a.end < b.end and b.level = a.level - 1 BELP: book/author select * from element e, path p where e.pathID = p.pathID and p.pathExp = ‘*book/author’

Experimental Setup: Platforms Software Apache Xerces C++ version 2.0 to parse the documents and generate the content of relations in different mapping approaches Commercial RDBMS: IBM DB2 UDB 7.0 32 MB Buffer pool size Hardware 1.2 GHz Pentium Celeron, 256 MB Memory Windows XP

Experimental Setup: Data Set Data set: the Shakespeare Plays XML documents The size of a copy of the data set is 8 MB To have the large size of the experimental data, use eight copies of the original Shakespeare data set. Thus, the total input data size is 64 MB Each data set has 37 files and each file is about 0.2 MB

Experimental Setup: Workload 6 micro-benchmark queries ‘element’ contains ‘text’ Direct containment Indirect containment Examples ACT/SCENE/SPEECH/LINE[contains(STAGEDIR, ‘Rising’)] /PLAY[contains(TITLE,’Juliet’)]//ACT/SCENE/SPEECH[contains(LINE, ‘love’)][contains(SPEAKER, ‘ROMEO’)]

Experimental Results Query Execution Times (seconds) BEL BELP PAID QS1 24.92 30.70 0.03 QS2 81.39 18.46 10.29 QS3 367.39 836.00 30.99 QS4 10.67 23.35 1.42 QS5 580.40 952.09 56.61 QS6 3.54 0.11 0.01 The PAID approach outperforms other approaches by several orders of magnitude

Experimental Results

Why Does PAID Perform Better? 1) PAID uses the parentID attribute to quickly find the parent nodes 2) PAID uses the path information to reduce the number of join operations in long path queries When only using the ‘begin, end, level’, the number of joins is proportional to the number of elements on the path 3) PAID uses the index on the value attribute to quickly retrieve the nodes that satisfy with the value-based predicates

Conclusions We can use an RDBMS for storing and querying XML data Pros: Has technology that has been developed for several decades (optimization, concurrency control, and recovery) Cons: not well-tuned for containment queries Performance on an RDBMS could be better if we encode more information Parent ID Path information

References [ZND+01] C. Zhang, J. Naughton, D. DeWitt, Q. Luo, and G. Lohman, “On Supporting Containment Queries in Relational Database Management Systems”, In SIGMOD 2001 [SYU+99] T. Shimura, M. Yoshikawa, and S. Uemura, “Storage and Retrieval of XML Documents Using Object-Relational Databases”, In International Conference on Database and Expert Systems Applications 1999