Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.

Similar presentations


Presentation on theme: "1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min."— Presentation transcript:

1 1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min

2 2 Contents Introduction Background Holistic Path Join Algorithms Twig join Algorithms Experimental Evaluation Conclusion

3 3 Introduction XML de facto standard of Data Exchange and Retrieval Tree structured model

4 4 Introduction XML Query Languages have specified tree structured relationship specify patterns of selection predicate ex) book[title = ‘ XML ’ ]//author[fn= ‘ jane ’ AND ln= ‘ doe ’ ]

5 5 Introduction Finding all occurrences of a twig pattern in a database is core operation Previous work decompose the twig pattern into a set of binary(p- c and a-d) relationships matching each of the binary relationships “ stitching ” together these basic matching

6 6 Introduction Contributions Two families of holistic path join algorithms Holistic path join approach Holistic twig join approach Experimental study

7 7 Background XML Data Model a XML database is a forest of rooted, ordered, labeled trees.

8 8 Background Indexing XML Documents Element positions represented as tuples (DocID, Left:Right, Level), sorted by Left Child and Descendant relationships between elements easily determined. author book jane … title XML year (1,6:20,3) … (1,1:150,1) … (1,8:8,5) … (1,43:43,5) … (1,2:4,2) (1,65:67,3) … (1,3:3,3) (1,66:66,4) … (1,61:63,2) …

9 9 Background Twig Pattern Matching Given a query twig pattern Q and an XML database D, compute the set of all matching for Q on D. book[title = ‘ XML ’ AND year = ‘ 2000 ’ ]

10 10 Background Previous attempts Based on binary joins Decompose query into binary relationships Solve binary joins against XML DB Combine together “ basic ” matches Main drawbacks: Optimization is required Intermediate results can be large book[title = ‘ XML ’ AND year = ‘ 2000 ’ ] ((book JOIN title)JOIN XML)JOIN (year JOIN 2000) (((book JOIN year)JOIN 2000)JOIN title)JOIN XML) many other possibilities

11 11 Holistic Joins Solve the entire twig query in two phases produce “ guaranteed ” partial results using one pass Combine (merge join) partial results Partial result smaller than final result effective encoding of partial results

12 12 Data Structure Each node q in query has associated: A stream T q, with the positions of the elements corresponding to node q, in increasing “ left ” order. A stack S q with a compact encoding of partial solutions (stacks are chained). a node (position, pointer to a node in S parent(q) )

13 13 Data Structure: Result representation Nodes in Stack S q are lie on a root-to- leaf path XML fragmentQuery Matches Stacks //A//C//D

14 14 Path Stack: Holistic Path Queries Repeatedly constructs stack encodings of partial solutions by iterating through the streams T q. Stacks encode the set of partial solutions from the current element in T q to the root of the XML tree. WHILE (!eof) qN = “getMin(q)” clean stacks push T qN ’s first element to S qN with the pointer to top(Sparent(qN)) IF qN is a leaf node, expand solutions

15 15 Path Stack Example

16 16 Twig Queries Na ï ve adaptation of PathStack solve each root-to-leaf path independantly Merge-Join each intermediate result Problem: Many intermediate results might not be part of the final answer.

17 17 Twig Stack 1) Compute only partial solutions that are guaranteed to extend to a final solution. 2) Merge partial solutions to obtain all matches. WHILE (!eof) qN = “getNext(q)” clean stacks IF T qN ’s first element is part of a solution, push it IF qN is a leaf node, expand solutions getNext might advance the streams in subTree(q) that are guaranteed not to be part of a solution

18 18 Twig Stack Key difference between PathStack and TwigStack is that a node h q from T q is pushed on its stack S q, Twig Stack ensure (1) node h q has a descendant h qi in each of the stream T qi, for qi ∈ children(q) (2) each node h qi, recursively satisfies the first property

19 19 Twig Stack Example before insert author to stack author, all child streams(T fn, T ln ) ’ s current elements are checked. Partial results are (6,11)(7,8) and (6,11)(9,10), then merge to generate final results. allauthors fn lnfn author ln author 1,16 9,107,8 6,11 3, 4 2,512,15 13,14 author fnln author fn ln (2,5) (6,11) (12,15) (3,4) (7,8) (9,10) (13,14)

20 20 Experiment Environments Implemented all algorithms in C++ using the file system as a simple storage engine. Synthetic database. Random XML documents. depth, fan-out, number of distinct labels Techniques compared: Binary Join techniques. PathStack. TwigStack.

21 21 PathStack vs. Binary Joins Sequential Scan: 1.87s Path Stack: 2.53s Binary Joins: 16.1s to 53.07s XML database fragment: 1 million nodes. Path Query: A1//A2//A3//A4//A5//A6

22 22 PathStack vs. TwigStack Query Data: a full ternary tree first subtree contains only A 1,A 2,A 3 and A 4 second subtree : A 1,A 5,A 6,A 7 third subtree contains all possible nodes Vary the size of thir subtree relative to the size of the first two from 8% to 24% A1A1 A3A3 A5A5 A2A2 A6A6 A7A7 A4A4

23 23 PathStack vs. TwigStack Partial solutions are discarded at the merge step

24 24 Conclusion Developed holistic path join algorithms Developed TwigStack, which generalizes PathStack for twig queries. better than binary join approach Future work Integrate TwigStack with value-based joins (id-refs, user defined predicates, etc.). Incorporate remaining axes (following, etc.).


Download ppt "1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min."

Similar presentations


Ads by Google