Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,

Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava, Yuqing Wu Modified from talk created by Sandhya Rani Are Prabhas Kumar Samanta

Introduction XML: Extensible Markup Language Documents have tags giving extra information about sections of the document E.g XML Introduction … Extensible, unlike HTML Users can add new tags, and separately specify how the tag should be handled for display

A-101 Downtown 500 A-101 Johnson Introduction

Comparison with Relational Data  Inefficient: tags, which in effect represent schema information, are repeated  Better than relational tuples as a data- exchange format.  Unlike relational tuples, XML data is self- documenting due to presence of tags.  Non-rigid format: tags can be added  Allows nested structures  Wide acceptance, not only in database systems, but also in browsers, tools, and applications

Structure of XML Data  Tag: label for a section of data  Element: section of data beginning with and ending with matching  Elements must be properly nested  Proper nesting  … ….  Improper nesting … ….  Mixture of text with sub-elements is legal in XML. e.g: This account is seldom used any more. A-102

More features of XML Schema  Attributes specified by xs:attribute tag: adding the attribute use = “required” means value must be specified  Key constraint: “account numbers form a key for account elements under the root bank element:  Foreign key constraint from depositor to account:

Querying and Transforming XML Data  Translation of information from one XML schema to another  Querying on XML data  Standard XML querying/translation languages  Xpath Simple language consisting of path expressions  XSLT Simple language designed for translation from XML to XML and XML to HTML  XQuery An XML query language with a rich set of features

Tree Model of XML Data  Query and transformation languages are based on a tree model of XML data  An XML document is modeled as a tree, with nodes corresponding to elements and attributes

XPath o XPath is used to address (select) parts of documents using path expressions o A path expression is a sequence of steps separated by “/” o Result of path expression: set of values that along with their containing elements/attributes match the specified path e.g /bank-2/customer/customer_name evaluated on the bank-2 data Joe Mary e.g. /bank-2/customer/customer_name/text( ) returns the same names, but without the enclosing tags

XPath (Cont.)  The initial “/” denotes root of the document (above the top- level tag)  Path expressions are evaluated left to right Each step operates on the set of instances produced by the previous step  Selection predicates may follow any step in a path, in [ ]  E.g. /bank-2/account[balance > 400] returns account elements with a balance value greater than 400  /bank-2/account[balance] returns account elements containing a balance subelement

XPath (Cont.)  Attributes are accessed using “@” e.g /bank-2/account[balance > 400]@account_number returns the account numbers of accounts with balance > 400 Anna Smith

More XPath Features  “//” can be used to skip multiple levels of nodes E.g. /bank-2//customer_name finds any customer_name element anywhere under the /bank-2 element, regardless of the element in which it is contained. A step in the path can go to parents, siblings, ancestors and descendants of the nodes generated by the previous step, not just to the children  “//”, described above, is a short from for specifying “all descendants”  “..” specifies the parent doc(name) returns the root of a named document

FLWOR Syntax in XQuery  find all accounts with balance > 400, with each result enclosed in an.. tag  for $x in /bank-2/account let $acctno := $x/@account_number where $x/balance > 400 return { $acctno }  Items in the return clause are XML text unless enclosed in {}, in which case they are evaluated  Xpath as sub-expressions  Allows joins, and complex aggregation (with group by using subqueries) which Xpath does not support

Efficient evaluation of Xpath PC/AD steps

Motivation Query : book[title='XML'] //author[. ='jane']

Query Tree book[title='XML'] //author[.='jane']

Decomposition Of Query Tree

Introduction XQuery Specify patterns of Selection Predicate having Tree Structural Relationship.  e.g. book[title = ‘XML’] // author[. = ‘jane’] The primitive tree structured relationships  Parent-child : (book, title), (title,XML), (author, jane)‏  Ancestor-descendant : (book, author)‏ Finding all occurrences of these relationships is a core operation for XML query processing.

Different ways of matching structural relationships  Tuple-at-a-time approach ➢ Tree traversal ➢ Using child & parent pointers ➢ Inefficient because complete pass through data  Pointer based approach ➢ Maintain (Parent,Child) pairs & identifying (ancestor,descendants) : High time complexity ➢ Maintain (ancestor,descendant) pairs : High space complexity ➢ Either case is infeasible

Solution: Set-at-a-time approach Uses mechanism ➢ Positional representation of occurrences of XML elements and string values ➢ Element 3 tuple (DocId, StartPos:EndPos, LevelNum)  String 3 tuple (DocId, StartPos, LevelNum)

Positional Representation

Structural Relationship Test Element E1(D1,S1:E1,L1)‏ Element E2(D2,S2:E2,L2)‏ If D1=D2, S1<S2 and E2<E1  E1-E2 is ancestor-descendant If D1=D2, S1<S2, E2<E1 and L1+1=L2  E1-E2 is parent-child

Structural Joins Join Algorithms for matching Structural Relationship  tree-merge and stack-tree Input: Lists of tree nodes sorted by (DocId, StartPos)‏ Output: Lists of sorted results joined according desired structural relationship. Use in XML Query Pattern matching  Query Tree Pattern  decompose  binary structural relationships.  Match each relationship with XML database  ‘Stitching’ together basic matches

Algorithm Tree-Merge-Anc Output : ordered by ancestors Algorithm : Loop through list of ancestors in increasing order of startPos ➢ For each ancestor, skip over unmatchable descendants ➢ check for ancestor-descendant relationship ( or parent-child relationship ) ➢ Append result to output list

Example Alist={Title_1} Dlist={Book_1, XML_1, Jane_1} Title_1  Skips Book_1 as it starts before Title_1.  Pairs with XML_1  Do not consider Jane_1 as it ends after Title_1. Book Author Jane Title XML AList Title_1 DList Book_1 XML_1 Jane_1

Worst case for Tree-Merge-Anc

Tree-Merge Join Detail Algorithm (O/p Sorted Ancestor/Parent order) ‏

Tree-Merge-Desc Algorithm Output : ordered by descendants Algorithm : Loop over Descendants list in increasing order of startPos ➢ For each descendant, skip over unmatchable ancestors ➢ check for ancestor-descendant relationship ( or parent-child relationship ) ➢ Append result to output list

Example Alist={Book_1, Title_1} Dlist={Book_1, XML_1, Jane_1} Book_1  doesn't have any matching a. XML_1  Pairs with Book_1, Title_1 Jane_1  Pairs with Book_1  Do not consider Title_1 (as Title_1 starts before Jane_1)‏ Book Author Jane Title XML AList Book_1 Title_1 DList Book_1 XML_1 Jane_1

Worst case for Tree-Merge-Desc

Tree-Merge Join Algorithm (O/p Sorted Descendent/Child order)‏

Stack-Tree Algorithm Basic idea: depth first traversal of XML tree – takes linear time with stack size = depth of tree – all ancestor-descendant relationships appear on stack during traversal Main problem: do not want to traverse the whole database, just nodes in A-list/D- list

Stack-Tree-Desc. (O/p sorted by Descendants)‏ Stack Contains Elements that can be ancestor of remaining Dlist elements Consider elements from Alist and Dlist one by one  If top can not be ancestors, POP it out.  If new 'a' has potential to be ancestor add to Stack  Else new 'd' will pair with all elements for Stack (Bottom to Top )‏

Example a1a1 a2a2 a3a3 d1d1 d6d6 d3d3 d2d2 d5d5 d4d4 AList DList a1a1 a1a1 a3a3 a2a2 a1a1 d2d2 d5d5 d4d4 d3d3 d1d1 d6d6 a1 a2a2 a3a3 d1d1 a3a3 d2d2 a2a2 a1a1 d3d3 Order d4d4 d6d6 d5d5 Output a1,d1 a1,d2 a2,d2 a1,d3 a2,d3 a3,d3 a1,d4 a2,d4 a3,d4 Pop a3 a1,d5 a2,d5 Pop a2 a1,d6

Stack-Tree Desc. (O/p sorted by Descendants)‏

Stack-Tree-Asc Output ordered by ancestors Basic problem: Results from a particular descendant cannot be output immediately - Later descendants may match earlier ancestor Solution: keep lists of matching descendant nodes with each stack node Self-list Descendants that match this node Add descendant node to self-lists of all matching ancestor nodes Inherit list Inherited from descendants already popped from stack, to be output after self-list matches are output

Example a1a1 a2a2 a3a3 d1d1 d6d6 d3d3 d2d2 d5d5 d4d4 AList DList a1a1 a1a1 a3a3 a2a2 a1a1 d2d2 d5d5 d4d4 d3d3 d1d1 d6d6 a1 a2a2 a3a3 a2,d 2 a1,d 2 a1,d 1 a3,d 3 a2,d 3 a1,d 3 a3,d 4 a1,d 4 a2,d 4 Pop a3 a3,d 3 a3,d 4 a2,d 5 a1,d 5 Pop a2 a2,d2|a2,d3|a2,d4|a2,d5|a3,d3|a3, d4 OUTPUT: (a1,d1),(a1,d2),(a1,d3),(a1,d4),(a1,d5),(a1,d6), a1,d 6 (a2,d2),(a2,d3),(a2,d4),(a2,d5),(a3,d3),(a3,d4) SELFLIST INHERIT-LIST

Algorithm

Experimental Evaluation Implemented the join algorithms in the TIMBER XML query engine. TIMBER is an native XML query engine that is built on top of SHORE Data set consist of 6.3m element node 800MB of XML document in text format Resuts are avg. of multiple run (warm cache)

Query used QS1 to QS6 are simple structural relationship queries QC1 and QC2 are complex chain queries evaluated using pipeline

performance

contd..

Holistic Twig Joins: Optimal XML Pattern Matching Author: Nicolas Bruno, Nick Koudas, Divesh Srivastava Source: ACM SIGMOD '2002 June4-6, Madison, Wisconsin, USA

Introduction XML Query: matching XML data with a tree structured pattern Previous attempts decompose query into small pieces and solve them separately complex optimization problem Intermediate results can be large This paper propose a novel holistic twig join approach for matching XML query twig patterns, where no large intermediate results are created - ((book title) XML) (year 2000) - (((book year) 2000) title) XML many other possibilities…

Twig Pattern Query twig patterns author fn ln janedoe Given a query twig pattern Q and an XML database D, compute the set of all matches for Q on D.

Holistic Join It also uses a chain of linked stacks to compactly represent partial results to individual query root-to-leaf paths. Path Stack Twig Stack Each node q in query has associated: A stream T q, with the positions of the elements corresponding to node q, in increasing “left” order. A stack S q with a compact encoding of partial solutions (chained). XML fragment Query Matches Stacks

PathStack: Holistic Path Queries Repeatedly constructs stack encodings of partial solutions by iterating through the streams T q. Stacks encode the set of partial solutions from the current element in T q to the root of the XML tree. WHILE (!eof) qN = “getMin(q)” clean stacks push T qN ’s first element to S qN IF qN is a leaf node, expand solutions

59 PathStack Example

Twig Queries Naive adaptation of PathStack. Solve each root-to-leaf path independently. Merge-join each intermediate result. Problem: Many intermediate results might not be part of the final answer.

Twig-Stack Compute only partial solutions that are guaranteed to extend to a final solution. Before pushing N in stack Sn, it ensure – N has a descendant present in each of the stream Tn for n ∈ children(N) Merge partial solutions to obtain all matches.

Example author fn ln janedoe

Questions ? Thank You

ORDPATHs: Insert-Friendly XML Node Labels Author : Patrick O’Neil, Elizabeth O’Neil1, Shankar Pal, Istvan Cseri, Gideon Schaller, Nigel Westbury Source : ACM SIGMOD 2004, June 13–18, 2004, Paris, France

Labelling schemes for XML trees Global order Each node is assigned a number that represents the node’s absolute position in the document. Dewey Order Each node is assigned a vector that represents the path from the document’s root to the node.

Problems ?? Works well for static XML data Poor performance for arbitrary insert and deletion Relabelling of many nodes is necessary

ORDPATHs Hierarchical labelling scheme like Dewey Provides efficient structural modification in xml data 1 1.11.31.5 Insert node from left and right 1 1.11.31.51.-1 1.7

More Insertion Example 1 1.11.31.5 Arbitrary Insert node 1 1.1 1.2.1 1.3 1.5 1 1.11.2.11.3 1.5 Arbitrary Insert node 1 1.2.-11.2.11.3 1.5 1.1

Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,

Similar presentations

Presentation on theme: "Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,

Similar presentations

Presentation on theme: "Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,"— Presentation transcript:

Similar presentations

About project

Feedback