XML Query Processing Yaw-Huei Chen

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.
XML May 3 rd, XQuery Based on Quilt (which is based on XML-QL) Check out the W3C web site for the latest. XML Query data model –Ordered !
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
XML and Enterprise Computing. What is XML? Stands for “Extensible Markup Language” –similar to SGML and HTML –document “tags” are used to define content.
DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.
XML, XML Schema, Xpath and XQuery Slides collated from various sources, many from Dan Suciu at Univ. of Washington.
CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.
1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.
XML: Extensible Markup Language. Slide Chapter Outline Introduction Structured, Semi structured, and Unstructured Data. XML Hierarchical (Tree)
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Querying XML (cont.). Comments on XPath? What’s good about it? What can’t it do that you want it to do? How does it compare, say, to SQL?
From Semistructured Data to XML: Migrating The Lore Data Model and Query Language Roy Goldman, Jason McHugh, Jennifer Widom Stanford University
1 COS 425: Database and Information Management Systems XML and information exchange.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Covering Indexes for Branching Path Queries Raghav Kaushik, Philip Bohannon, Jeffrey F Naughton and Henry F Korth 1Abdullah Mueen.
Winter 2002Arthur Keller – CS 18018–1 Schedule Today: Mar. 12 (T) u Semistructured Data, XML, XQuery. u Read Sections Assignment 8 due. Mar. 14.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
4/17/2017.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet.
4/20/2017.
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
Maziar Sanaii Ashtiani – SCT – EMU, Fall 2011/12.
XML and XPath. Web Services: XML+XPath2 EXtensible Markup Language (XML) a W3C standard to complement HTML A markup language much like HTML origins: structured.
28 October 2008CIS 340 # 1 Topics To define XML as a technology To place XML in the context of system architectures Online support:
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Winter 2006Keller, Ullman, Cushing18–1 Plan 1.Information integration: important new application that motivates what follows. 2.Semistructured data: a.
JSTL, XML and XSLT An introduction to JSP Standard Tag Library and XML/XSLT transformation for Web layout.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.
Database Systems Part VII: XML Querying Software School of Hunan University
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Copyright © 2004 Ramez Elmasri and Shamkant Navathe Elmasri/Navathe, Fundamentals of Database Systems, Fourth Edition Chapter 26-2 Introduction Although.
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
XML and Database.
CS 157B: Database Management Systems II February 11 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron.
24 January Trees CSE 2011 Winter Trees Linear access time of linked lists is prohibitive  Does there exist any simple data structure for.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Lecture 17: XPath and XQuery Wednesday, Nov. 7, 2001.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
XML Notes taken from w3schools. What is XML? XML stands for EXtensible Markup Language. XML was designed to store and transport data. XML was designed.
Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
© 2017 by McGraw-Hill Education. This proprietary material solely for authorized instructor use. Not authorized for sale or distribution in any manner.
XML to Relational Database Mapping
XML: Extensible Markup Language
Querying and Transforming XML Data
XML QUESTIONS AND ANSWERS
XML: Extensible Markup Language
RE-Tree: An Efficient Index Structure for Regular Expressions
Binary Trees, Binary Search Trees
OrientX: an Integrated, Schema-Based Native XML Database System
(b) Tree representation
Semi-Structured data (XML Data MODEL)
Structure and Content Scoring for XML
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Copyright ©2012 by Pearson Education, Inc. All rights reserved
Structure and Content Scoring for XML
CS210- Lecture 9 June 20, 2005 Announcements
Important Problem Types and Fundamental Data Structures
More XML XML schema, XPATH, XSLT
Semi-Structured data (XML)
Presentation transcript:

XML Query Processing Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University

Outline Introduction to XML Query Languages Indexing Query Processing Incremental Cache Maintenance Testing Reachability Conclusions 5/12/2006

From Documents to Data <h1>References</h1> HTML describes presentation <h1>References</h1> <p>S. Abiteboul, P. Buneman, D. Suciu, <i>Data On The Web</i>, 2000.</p> 5/12/2006

From Documents to Data (cont.) XML (eXtensible Markup Language) describes content <references> <book> <author>S. Abiteboul</author> <author>P. Buneman</author> <author>D. Suciu</author> <title>Data On The Web</title> <year>2000</year> </book> </references> 5/12/2006

XML Syntax Element Attribute XML document has a single root element a piece of text bounded by matching tags elements can be nested <author>D. Suciu</author> Attribute unordered, each associated with an element node, has a name and a value alternative ways to represent data <book price="50" currency="USD">… </book> XML document has a single root element Well-formed XML documents tags must nest properly attributes must be unique 5/12/2006

XML Hierarchical Data Model references book author title year 2000 Data on the Web S. Abiteboul P. Buneman D. Suciu … XML is ordered 5/12/2006

Specifying the Structure DTD (Document Type Definition): A context-free grammar <!DOCTYPE references [ <!ELEMENT references (book+)> <!ELEMENT book (author*, title, year?)> <!ELEMENT author (#PCDATA)> <!ELEMENT title (#PCDATA)> <!ELEMENT year (#PCDATA)> ]> 5/12/2006

Specifying the Structure (cont.) XML Schema in XML format element names and types associated locally includes primitive data types a superset of DTDs Valid XML documents the document must be well-formed the element names must follow the structure specified in a DTD file or an XML schema file 5/12/2006

Storing XML Documents Designing a specialized system for storing native XML data Using a DBMS to store the whole XML documents as text fields Using a DBMS to store the document contents as data elements It must support the XML’s ordered data model 5/12/2006

XPath Using path expressions to select nodes or node-sets Single slash (/) : a direct child Double slash (//): a descendant at any level /references selects the root element references //book selects all book elements references//book selects all book elements that are descendant of the references element /references/* selects all the child nodes of the references element 5/12/2006

XQuery XQuery uses XPath expressions, but has additional constructs. FLWR stands for the four main clauses of XQuery: FOR <variable bindings to individual nodes (elements)> LET <variable bindings to collections of nodes (elements)> WHERE <qualifier conditions> RETURN <query result specification> For example: for $b in doc("references.xml")//book where count ($b/author) > 0 return <book> { $b/title } { for $a in $b/author return $a } </book> 5/12/2006

Indexing Efficient mechanisms are needed for Determining the ancestor-descendant relationship between XML elements Two types of indexes can help Structural index: It can reduce the time for traversing the XML hierarchy. Numbering scheme: It encodes each element by its positional information within the XML hierarchy. 5/12/2006

Structural Index DataGuides [Goldman97]: Every label path of the source graph has exactly one data path instance in its DataGuide. C D A B C D A B A B B C D C D C D 5/12/2006

Structural Index (cont.) 1-Index [Milo99]: Grouping together nodes if they have the same set of incoming paths D C A B C A B D C A B D data graph 1-index dataguide 5/12/2006

Structural Index (cont.) Covering indexes [Kaushik02] Forward and Backward Index (F&B-Index) Add inverse edges to the graph Compute the 1-index (or DataGuide) for the modified graph The size of F&B-Index is too large. To reduce the size: only useful tags are indexed do not index all idref edges (XPath gives a higher priority to tree edges and // matches only tree edges) exploit local similarity (short paths only) restrict tree depth 5/12/2006

Numbering Scheme Dewey Decimal Coding [Tatarinov02] 1 references 1.1 1.2 book book author author title year author title 1.1.1 1.1.2 1.1.3 1.1.4 1.2.1 1.2.2 5/12/2006

Numbering Scheme (cont.) Inserting new elements references book author title year 1 1.1 1.2 1.1.1 1.1.2 1.1.3 1.1.4 1.1.5 1.2.1 1.2.2 new element nodes that require renumbering 5/12/2006

Numbering Scheme (cont.) Preorder and postorder [Dietz82] (preorder, postorder) x is an ancestor of y iff x occurs before y in the preorder traversal and after y in the postorder traversal. references book author title year (1,10) (2,6) (8,9) (3,1) (4,2) (5,3) (6,4) (7,5) (9,7) (10,8) 5/12/2006

Numbering Scheme (cont.) Various interval schemes (docno, begin:end, level) [Zhang01] The begin and end positions can be generated by doing a depth-first traversal of the tree and sequentially assigned a number at each visit. (preorder, size) [Li01] Size is an arbitrary integer larger than the total number of the current descendants. (lowest_post, postorder) [Agrawal89] Lowest_post is the lowest postorder number of its descendants. 5/12/2006

Query Processing To find all occurrences of a query pattern in the XML documents. Navigation-based approach Analyzing the input document one tag at a time. The query is represented as a non-deterministic finite automaton (NFA) [Diao03] Index-based approach Using pre-computed indexes to answer the query 5/12/2006

Holistic Twig Join [Bruno02] Indexes string: (doc, left, level) element: (doc, left: right, level) Query: A//B//C A1 B1 A2 B2 C1 data SA SB SC A1 A2 B1 B2 C1 A1 B1 C1 A1 B2 C1 A2 B2 C1 stack encoding query results 5/12/2006

Stream Processing (Path) A//B//C C4 (a) XPath Query B4 C2 A1 A1 B1 C1 A B C Start Pointer End Pointer B1 C3 (c) Query Stack Structure C1 D2 B3 B4 A1B1C1 A1B1C2 A1B4C4 D1 C2 B2 C4 E1 5/12/2006 (b) XML Data (d) Query Result

TWIG Algorithm Query structures Advantages Stacks – representing query elements Pointers – links between stacks Start pointer, End pointer, Next pointer All results are stored in the stacks Advantages Can process twig query No join problem 5/12/2006

Stream Processing (Twig) (a) Twig Query XPath : A[//B[//C]/D]//B A B C D B4 B3 B D2 A1 D A1 A B4 C2 B1 Start Pointer End Pointer Next Pointer C1 B C B1 C3 (c) Stack Structure of Twig Query C1 D2 B3 B4 (d) Twig Query Result A1B1C1D2B3 A1B1C1D2B4 D1 C2 B2 C4 E1 5/12/2006 (b) XML Data

XML Cache Maintenance Benefits of using caching data Improving query performance Reducing loads in databases 5/12/2006

Update Maintenance Proxy Source Database Time (1) The Source Query (2) Query Result Cache0 Update Update (1) Update Path Information Data1 (2) The Source Query ( Optional ) Cache1 (3) Query Result ( Optional ) Datan Cachen 5/12/2006

Contributions Incremental maintenance of XML cache Number of source query: 0 Processing both path and twig query Improving query performance Reducing cache size 5/12/2006

XML Source Updates Two primitive operations Limitations Insertion Insert a leaf node into an XML document Deletion Delete a leaf node from an XML document Limitations One operation at a time The XML document should be indexed 5/12/2006

Incremental Cache Maintenance Two phases Differences discovery Data integration Query types supported Path query Twig query 5/12/2006

Path Query - Insertion 18 19 A: 1 (Name: Index) A B C 8 18 9 19 20 14 8 11 1 2 4 5 C: 3 B: 8 C: 13 B: 14 B: 4 D: 7 B: 9 C: 11 B: 15 B: 18 Start Point End Point C: 5 E: 6 E: 10 (d) Cache0 D: 16 E: 17 C: 19 C: 20 A B C 8 9 (a) XML Data 1 2 8 20 A//B//B//C (e) Temp Structure (b) XPath Query A B C 18 19 8 9 20 14 8 11 1 2 4 5 Update Path: A/B/B/B/E/C Update Index Path: 1/2/8/9/10/20 (c) Update Path (f) Cache1 5/12/2006

Path Query - Deletion A: 1 (Name: Index) A B C 18 19 8 18 9 20 19 B: 2 14 8 11 C: 3 B: 8 C: 13 B: 14 1 2 4 5 B: 4 D: 7 B: 9 C: 11 B: 15 B: 18 (d) Cache1 C: 5 E: 6 E: 10 D: 16 E: 17 C: 19 C: 20 A B C 8 9 (a) XML Data 1 2 8 20 A//B//B//C (b) XPath Query (e) Temp Structure A B C 18 19 14 8 11 1 2 4 5 Update Path: A/B/B/B/E/C Update Index Path: 1/2/8/9/10/20 (c) Update Path Start Point End Point 5/12/2006 (f) Cache2

(d) Source Query Structure Twig Query – Insertion A B E C D 11 12 A: 1 9 7 B: 2 C: 8 6 C: 3 D: 7 E: 9 D: 10 1 4 B: 4 D:12 E: 11 2 5 3 C: 5 D: 6 (d) Source Query Structure (d) Cache0 (d) Cache1 (a) XML Data A B E C D 11 A 9 B E 12 C D A[//B[//C]//D]//E (b) XPath Query 1 2 5 3 (e) Temp Structure Update Path: A/B/D/D Update Index Path: 1/2/7/12 (c) Update Path Start Point End Point Next Point 5/12/2006

Experiments Language Data sets Query time Java: Borland JBuilder 9 Enterprise Data sets Real data: 4.41MB Synthetic data: 860KB Query time Incremental maintenance: XCM algorithm Full re-computation 5/12/2006

Full re-computation (s) Experiments Average Time Query XCM Algorithm (s) Full re-computation (s) Synthetic Data Set A/B/D 0.00412 1.79024 A//B//C/D 0.00802 2.25693 A//B//B//C//E 0.04575 7.6941 Real issues//articles//authors 0.01643 8.40633 issuesTuple//articlesTuples/title 0.03252 15.9213 issues//articles//author[@AuthorPosition] 0.06046 17.06484 5/12/2006

Reachability Many problems are modeled as DAGs Common requirement Efficient reachability testing 5/12/2006

Related Work Simple way Transitive closure matrix Bit-vector encoding schemes Numerical interval-based approach Other approaches 5/12/2006

The Concept We propose a new encoding scheme find a spanning tree for the graph label the spanning tree with numerical intervals generate a transitive closure matrix for the tails and heads of the non-tree edges test reachability in constant time support updates Moreover, our approach is effective O(|Vnt|2) space costs, not O(|V |2) 5/12/2006

Transitive Closure Matrix Mnt Example Tree edge Non-tree edge (1,18) a   Head nodes Tail nodes b c g a 1 f h i (8,17) (2,7) b e (15,16) (9,12) (3,6) c f i h (13,14) (10,11) d g Transitive Closure Matrix Mnt (4,5) 5/12/2006

1 18 9 12 13 14 15 16 Tail node ranges 2 7 3 6 10 11 Head node ranges (1,18) a T 1 18 (8,17) a H b e (2,7) 9 12 13 14 15 16 f h i Tail node ranges (15,16) (9,12) (3,6) c H f T T i 2 7 b h T (13,14) (10,11) 3 6 10 11 d g H (4,5) c g Head node ranges 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Tail begin + Tail end - Head begin Head end 5/12/2006

1 18 9 12 13 14 15 16 Tail node ranges Tree edge Non-tree edge (1,18) (15,16) (8,17) (2,7) (3,6) (13,14) (9,12) (4,5) (10,11) a b i c f h d g e 1 18 a 9 12 13 14 15 16 f h i Tail node ranges Head node b’s tail ranges in Mnt 1 9 12 13 14 15 16 18 Begin + End - Head node c’s tail ranges in Mnt Head node g’s tail ranges in Mnt   Head nodes Tail nodes b c g a 1 f h i Transitive Closure Matrix Mnt 5/12/2006

Analysis Label the spanning tree with numerical intervals decide the root in O(|E | + |V |) time. generating the spanning tree takes O(|V |) time and requires O(|V |) space. Find the representing tail and head ranges O(|Vnt| log |Vnt|) time and O(|Vnt|) space Generate the transitive closure matrix Mnt O(|Vnt|3) time and O(|Vnt|2) space. Update 5/12/2006

Conclusions A new encoding scheme to record the transitive closure information using nested numerical intervals O(|Vnt|3 + |V |) time and O(|Vnt|2 + |V |) space complexity Testing reachability in O(1) time Updating locally by adjusting the encoding information. 5/12/2006

Future Work Version management Materialized views Cache management Aggregate query processing Streaming data processing ... 5/12/2006