Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,

Slides:



Advertisements
Similar presentations
Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.
Advertisements

APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.
XML Examples. Bank Information Basic structure: A-101 Downtown 500 … Johnson Alma Surrey … A-101 Johnson …
XML: Extensible Markup Language
Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al Khalifa et al., ICDE 2002.
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
QUANZHONG LI BONGKI MOON Indexing & Querying XML Data for../Regular Path Expressions/* SUNDAR SUPRIYA.
XML Query Processing Talk prepared by Bhavana Dalvi ( ) Uma Sawant ( )
Web Data Management XML Query Evaluation 1. Motivation PTIME algorithms for evaluating XPath queries: – Simple tree navigation – Translation into logic.
1 COS 425: Database and Information Management Systems XML and information exchange.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
Manohar – Why XML is Required Problem: We want to save the data and retrieve it further or to transfer over the network. This.
XMLII XSchema XSchema XQuery XQuery. XML Schema XML Schema is a more sophisticated schema language which addresses the drawbacks of DTDs. Supports XML.
XML Query Languages Notes Based on Chapter 10 of Database System Concepts.
1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University.
Lecture 7 of Advanced Databases XML Querying & Transformation Instructor: Mr.Ahmed Al Astal.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
XML By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
Chapter 10: XML.
Lecture 6 of Advanced Databases XML Querying & Transformation Instructor: Mr.Eyad Almassri.
Computing & Information Sciences Kansas State University Friday, 17 Oct 2007CIS 560: Database System Concepts Lecture 21 of 42 Friday, 17 October 2008.
Chapter 10: XML XML Structure of XML Data XML Document Schema Querying and Transformation Application Program Interfaces to XML Storage of XML Data.
1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002.
XMLI Structure of XML Data Structure of XML Data XML Document Schema XML Document Schema XPATH XPATH.
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Querying Structured Text in an XML Database By Xuemei Luo.
Computing & Information Sciences Kansas State University Thursday, 15 Mar 2007CIS 560: Database System Concepts Lecture 24 of 42 Thursday, 15 March 2007.
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML1 Efficient Structural Joins on Indexed XML Documents Shu-Yao Chien, Zografoula Vagena, Donghui.
TwigStackList¬: A Holistic Twig Join Algorithm for Twig Query with Not-predicates on XML Data by Tian Yu, Tok Wang Ling, Jiaheng Lu, Presented by: Tian.
Database Systems Part VII: XML Querying Software School of Hunan University
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
XML Name: Niki Sardjono Class: CS 157A Instructor : Prof. S. M. Lee.
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
©Silberschatz, Korth and Sudarshan10.1Database System Concepts XML Query Languages Notes Based on Chapter 10 of Database System Concepts.
Computing & Information Sciences Kansas State University Friday, 20 Oct 2006CIS 560: Database System Concepts Lecture 24 of 42 Friday, 20 October 2006.
Dr. N. MamoulisAdvanced Database Technologies1 Topic 8: Semi-structured Data In various application domains, the data are semi-structured; the database.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
Holistic Twig Joins: Optimal XML Pattern Matching Written by: Nicolas Bruno Nick Koudas Divesh Srivastava Presented by: Jose Luna John Bassett.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Holistic Twig Joins: Optimal XML Pattern Matching Nicholas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 02 Presented by: Li Wei, Dragomir Yankov.
1 Structural Join Algorithms – Examples Key property: x is a descendant (resp., child) of y iff x.docId = y.docId & x.StartPos < y.StartPos
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
XML Query languages--XPath. Objectives Understand XPath, and be able to use XPath expressions to find fragments of an XML document Understand tree patterns,
XML: Extensible Markup Language
Module 11: File Structure
Querying and Transforming XML Data
CS510 Compiler Lecture 4.
Efficient processing of path query with not-predicates on XML data
Database Management System
Presented by Sandhya Rani Are Prabhas Kumar Samanta
CS 480: Database Systems Lecture 28 March 22, 2013.
Holistic Twig Joins: Optimal XML Pattern Matching
Chapter 12: Query Processing
Indexing and Hashing Basic Concepts Ordered Indices
Lecture 2- Query Processing (continued)
Early Profile Pruning on XML-aware Publish-Subscribe Systems
XML Query Processing Yaw-Huei Chen
Structural Joins: A Primitive for Efficient XML Query Pattern Matching
Presentation transcript:

Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava, Yuqing Wu Modified from talk created by Sandhya Rani Are Prabhas Kumar Samanta

Introduction XML: Extensible Markup Language Documents have tags giving extra information about sections of the document E.g XML Introduction … Extensible, unlike HTML Users can add new tags, and separately specify how the tag should be handled for display

A-101 Downtown 500 A-101 Johnson Introduction

Comparison with Relational Data  Inefficient: tags, which in effect represent schema information, are repeated  Better than relational tuples as a data- exchange format.  Unlike relational tuples, XML data is self- documenting due to presence of tags.  Non-rigid format: tags can be added  Allows nested structures  Wide acceptance, not only in database systems, but also in browsers, tools, and applications

Structure of XML Data  Tag: label for a section of data  Element: section of data beginning with and ending with matching  Elements must be properly nested  Proper nesting  … ….  Improper nesting … ….  Mixture of text with sub-elements is legal in XML. e.g: This account is seldom used any more. A-102

More features of XML Schema  Attributes specified by xs:attribute tag: adding the attribute use = “required” means value must be specified  Key constraint: “account numbers form a key for account elements under the root bank element:  Foreign key constraint from depositor to account:

Querying and Transforming XML Data  Translation of information from one XML schema to another  Querying on XML data  Standard XML querying/translation languages  Xpath Simple language consisting of path expressions  XSLT Simple language designed for translation from XML to XML and XML to HTML  XQuery An XML query language with a rich set of features

Tree Model of XML Data  Query and transformation languages are based on a tree model of XML data  An XML document is modeled as a tree, with nodes corresponding to elements and attributes

XPath o XPath is used to address (select) parts of documents using path expressions o A path expression is a sequence of steps separated by “/” o Result of path expression: set of values that along with their containing elements/attributes match the specified path e.g /bank-2/customer/customer_name evaluated on the bank-2 data Joe Mary e.g. /bank-2/customer/customer_name/text( ) returns the same names, but without the enclosing tags

XPath (Cont.)  The initial “/” denotes root of the document (above the top- level tag)  Path expressions are evaluated left to right Each step operates on the set of instances produced by the previous step  Selection predicates may follow any step in a path, in [ ]  E.g. /bank-2/account[balance > 400] returns account elements with a balance value greater than 400  /bank-2/account[balance] returns account elements containing a balance subelement

XPath (Cont.)  Attributes are accessed using e.g /bank-2/account[balance > returns the account numbers of accounts with balance > 400 Anna Smith

More XPath Features  “//” can be used to skip multiple levels of nodes E.g. /bank-2//customer_name finds any customer_name element anywhere under the /bank-2 element, regardless of the element in which it is contained. A step in the path can go to parents, siblings, ancestors and descendants of the nodes generated by the previous step, not just to the children  “//”, described above, is a short from for specifying “all descendants”  “..” specifies the parent doc(name) returns the root of a named document

FLWOR Syntax in XQuery  find all accounts with balance > 400, with each result enclosed in an.. tag  for $x in /bank-2/account let $acctno := where $x/balance > 400 return { $acctno }  Items in the return clause are XML text unless enclosed in {}, in which case they are evaluated  Xpath as sub-expressions  Allows joins, and complex aggregation (with group by using subqueries) which Xpath does not support

Efficient evaluation of Xpath PC/AD steps

Motivation Query : book[title='XML'] //author[. ='jane']

Query Tree book[title='XML'] //author[.='jane']

Decomposition Of Query Tree

Introduction XQuery Specify patterns of Selection Predicate having Tree Structural Relationship.  e.g. book[title = ‘XML’] // author[. = ‘jane’] The primitive tree structured relationships  Parent-child : (book, title), (title,XML), (author, jane)‏  Ancestor-descendant : (book, author)‏ Finding all occurrences of these relationships is a core operation for XML query processing.

Different ways of matching structural relationships  Tuple-at-a-time approach ➢ Tree traversal ➢ Using child & parent pointers ➢ Inefficient because complete pass through data  Pointer based approach ➢ Maintain (Parent,Child) pairs & identifying (ancestor,descendants) : High time complexity ➢ Maintain (ancestor,descendant) pairs : High space complexity ➢ Either case is infeasible

Solution: Set-at-a-time approach Uses mechanism ➢ Positional representation of occurrences of XML elements and string values ➢ Element 3 tuple (DocId, StartPos:EndPos, LevelNum)  String 3 tuple (DocId, StartPos, LevelNum)

Positional Representation

Structural Relationship Test Element E1(D1,S1:E1,L1)‏ Element E2(D2,S2:E2,L2)‏ If D1=D2, S1<S2 and E2<E1  E1-E2 is ancestor-descendant If D1=D2, S1<S2, E2<E1 and L1+1=L2  E1-E2 is parent-child

Structural Joins Join Algorithms for matching Structural Relationship  tree-merge and stack-tree Input: Lists of tree nodes sorted by (DocId, StartPos)‏ Output: Lists of sorted results joined according desired structural relationship. Use in XML Query Pattern matching  Query Tree Pattern  decompose  binary structural relationships.  Match each relationship with XML database  ‘Stitching’ together basic matches

Algorithm Tree-Merge-Anc Output : ordered by ancestors Algorithm : Loop through list of ancestors in increasing order of startPos ➢ For each ancestor, skip over unmatchable descendants ➢ check for ancestor-descendant relationship ( or parent-child relationship ) ➢ Append result to output list

Example Alist={Title_1} Dlist={Book_1, XML_1, Jane_1} Title_1  Skips Book_1 as it starts before Title_1.  Pairs with XML_1  Do not consider Jane_1 as it ends after Title_1. Book Author Jane Title XML AList Title_1 DList Book_1 XML_1 Jane_1

Worst case for Tree-Merge-Anc

Tree-Merge Join Detail Algorithm (O/p Sorted Ancestor/Parent order) ‏

Time and Space Complexity The space and time complexity of Tree-Merge- Anc are O(|AList|+|Dlist|+|Outputlist|) for ancestor-descendant structural relationships Optimal But result sorted on ancestors Cost of resorting on descendants can be significant But for P-C relationship, Tree-Merge-Anc complexity is O(|AList|+|Dlist| 2 ) even if OutputList is linear in |Alist|,

Tree-Merge-Desc Algorithm Output : ordered by descendants Algorithm : Loop over Descendants list in increasing order of startPos ➢ For each descendant, skip over unmatchable ancestors ➢ check for ancestor-descendant relationship ( or parent-child relationship ) ➢ Append result to output list

Example Alist={Book_1, Title_1} Dlist={Book_1, XML_1, Jane_1} Book_1  doesn't have any matching a. XML_1  Pairs with Book_1, Title_1 Jane_1  Pairs with Book_1  Do not consider Title_1 (as Title_1 starts before Jane_1)‏ Book Author Jane Title XML AList Book_1 Title_1 DList Book_1 XML_1 Jane_1

Worst case for Tree-Merge-Desc

Tree-Merge Join Algorithm (O/p Sorted Descendent/Child order)‏

Time and Space Complexity The time complexity of Tree-Merge-Desc are O(|AList|+|Dlist|+|Outputlist| 2 ) for ancestor- descendant structural relationships But not so bad in practice

Stack-Tree Algorithm Basic idea: depth first traversal of XML tree – takes linear time with stack size = depth of tree – all ancestor-descendant relationships appear on stack during traversal Main problem: do not want to traverse the whole database, just nodes in A-list/D- list

Stack-Tree-Desc. (O/p sorted by Descendants)‏ Stack Contains Elements that can be ancestor of remaining Dlist elements Consider elements from Alist and Dlist one by one  If top can not be ancestors, POP it out.  If new 'a' has potential to be ancestor add to Stack  Else new 'd' will pair with all elements for Stack (Bottom to Top )‏

Example a1a1 a2a2 a3a3 d1d1 d6d6 d3d3 d2d2 d5d5 d4d4 AList DList a1a1 a1a1 a3a3 a2a2 a1a1 d2d2 d5d5 d4d4 d3d3 d1d1 d6d6 a1 a2a2 a3a3 d1d1 a3a3 d2d2 a2a2 a1a1 d3d3 Order d4d4 d6d6 d5d5 Output a1,d1 a1,d2 a2,d2 a1,d3 a2,d3 a3,d3 a1,d4 a2,d4 a3,d4 Pop a3 a1,d5 a2,d5 Pop a2 a1,d6

Stack-Tree Desc. (O/p sorted by Descendants)‏

Time and Space Complexity The time complexity of Stack-Tree-Desc is O(|AList|+|Dlist|+|Outputlist|) for ancestor- descendant as well as parent-child structural relationships IO complexity of Stack-Tree-Desc is O(|AList|/B + |Dlist|/B + |Outputlist|/B) where B is the blocking factor, for AD and PC relationships

Stack-Tree-Asc Output ordered by ancestors Basic problem: Results from a particular descendant cannot be output immediately - Later descendants may match earlier ancestor Solution: keep lists of matching descendant nodes with each stack node Self-list Descendants that match this node Add descendant node to self-lists of all matching ancestor nodes Inherit list Inherited from descendants already popped from stack, to be output after self-list matches are output

Example a1a1 a2a2 a3a3 d1d1 d6d6 d3d3 d2d2 d5d5 d4d4 AList DList a1a1 a1a1 a3a3 a2a2 a1a1 d2d2 d5d5 d4d4 d3d3 d1d1 d6d6 a1 a2a2 a3a3 a2,d 2 a1,d 2 a1,d 1 a3,d 3 a2,d 3 a1,d 3 a3,d 4 a1,d 4 a2,d 4 Pop a3 a3,d 3 a3,d 4 a2,d 5 a1,d 5 Pop a2 a2,d2|a2,d3|a2,d4|a2,d5|a3,d3|a3, d4 OUTPUT: (a1,d1),(a1,d2),(a1,d3),(a1,d4),(a1,d5),(a1,d6), a1,d 6 (a2,d2),(a2,d3),(a2,d4),(a2,d5),(a3,d3),(a3,d4) SELFLIST INHERIT-LIST

Algorithm

Time and Space Complexity of Stack-Tree-Anc The time complexity of Stack-Tree-Anc is O(|AList|+|Dlist|+|OutputList|) for ancestor- descendant as well as parent-child structural relationships IO complexity of Stack-Tree-Desc is O(|AList|/B + |Dlist|/B + |Outputlist|/B) where B is the blocking factor, for AD and PC relationships Requires proper handling of list operations

Experimental Evaluation Implemented the join algorithms in the TIMBER XML query engine. TIMBER is an native XML query engine that is built on top of SHORE Data set consist of 6.3m element node 800MB of XML document in text format Resuts are avg. of multiple run (warm cache)

Query used QS1 to QS6 are simple structural relationship queries QC1 and QC2 are complex chain queries evaluated using pipeline

performance

contd..

Holistic Twig Joins: Optimal XML Pattern Matching Author: Nicolas Bruno, Nick Koudas, Divesh Srivastava Source: ACM SIGMOD '2002 June4-6, Madison, Wisconsin, USA

Introduction XML Query: matching XML data with a tree structured pattern Previous attempts decompose query into small pieces and solve them separately complex optimization problem Intermediate results can be large This paper propose a novel holistic twig join approach for matching XML query twig patterns, where no large intermediate results are created - ((book title) XML) (year 2000) - (((book year) 2000) title) XML many other possibilities…

Twig Pattern Query twig patterns author fn ln janedoe Given a query twig pattern Q and an XML database D, compute the set of all matches for Q on D.

Holistic Join It also uses a chain of linked stacks to compactly represent partial results to individual query root-to-leaf paths. Path Stack Twig Stack Each node q in query has associated: A stream T q, with the positions of the elements corresponding to node q, in increasing “left” order. A stack S q with a compact encoding of partial solutions (chained). XML fragment Query Matches Stacks

PathStack: Holistic Path Queries Repeatedly constructs stack encodings of partial solutions by iterating through the streams T q. Stacks encode the set of partial solutions from the current element in T q to the root of the XML tree. WHILE (!eof) qN = “getMin(q)” clean stacks push T qN ’s first element to S qN IF qN is a leaf node, expand solutions

59 PathStack Example

Twig Queries Naive adaptation of PathStack. Solve each root-to-leaf path independently. Merge-join each intermediate result. Problem: Many intermediate results might not be part of the final answer.

Twig-Stack Compute only partial solutions that are guaranteed to extend to a final solution. Before pushing N in stack Sn, it ensure – N has a descendant present in each of the stream Tn for n ∈ children(N) Merge partial solutions to obtain all matches.

Example author fn ln janedoe

Questions ? Thank You

ORDPATHs: Insert-Friendly XML Node Labels Author : Patrick O’Neil, Elizabeth O’Neil1, Shankar Pal, Istvan Cseri, Gideon Schaller, Nigel Westbury Source : ACM SIGMOD 2004, June 13–18, 2004, Paris, France

Labelling schemes for XML trees Global order Each node is assigned a number that represents the node’s absolute position in the document. Dewey Order Each node is assigned a vector that represents the path from the document’s root to the node.

Problems ?? Works well for static XML data Poor performance for arbitrary insert and deletion Relabelling of many nodes is necessary

ORDPATHs Hierarchical labelling scheme like Dewey Provides efficient structural modification in xml data Insert node from left and right

More Insertion Example Arbitrary Insert node Arbitrary Insert node