Presentation is loading. Please wait.

Presentation is loading. Please wait.

On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling.

Similar presentations


Presentation on theme: "On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling."— Presentation transcript:

1 On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

2 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 2 Outline Background XML Twig Pattern Query Previous Twig Join algorithms Limit of the original holistic method TwigStack Our holistic Twig Pattern Matching algorithms Two Refined Indexing Schemes: Tag+Level and PPS A generalized holistic matching theory iTwigJoin: a generalized holistic matching algorithm Experiments Conclusion

3 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 3 Background: XML and Region coding XML document is modeled as a tree in our work Region Coding for XML document tree label for each element Containment Property: a.start b.end if and only if a is an ancestor of b

4 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 4 Background: XML twig pattern queries An XML twig query is a small tree, whose edges include parent-child or ancestor-descendant relationships. Given an XML document D, and an XML twig query Q, our problem is to find all occurrences of Q on D.

5 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 5 Previous XML Twig Join algorithms Techniques Edge Based Binary Structural Join [Al-Khalifa et al ICDE02] Join Order Selection [Wu et al ICDE03] Path Based BLAS [Chen et al SIGMOD04] Tree (Holistic) Based TwigStack [Bruno et al SIGMOD02] TwigStackList [Lu et al CIKM04] Index Based B tree [ [Chien et al VLDB02 ] XR tree [Jiang et al ICDE02] TSGeneric+ [Jiang et al VLDB03]

6 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 6 Holistic Twig Matching TwigStack [Bruno et al SIGMOD02] A holistic twig join algorithm E.g: For query A[.//C]//B, there may be many matches only to A//B. But TwigStack only output results for A with descendants B and C. No join order selection required TwigStack is optimal for only ancestor-descendant twig patterns. Reordering of elements in a stream does not help. [Choi et al DEXA03]

7 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 7 Sub-optimality of TwigStack Not optimal for twigs with parent-child edge a1a1 b1b1 a2a2 anan cncn b2b2 c1c1 bnbn c n -1 … a1a1 a 2 … a n b1b1 b 2 … b n c1c1 c 2 … cncn A BC Query Document

8 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 8 Two Refined Streaming Schemes(1) To enlarge the optimality of TwigStack, in our paper we proposed two refined streaming schemes. Tag + Level: elements with the same tag and level are grouped together a1a1 b1b1 a2a2 anan cncn … b2b2 c1c1 bnbn c n -1 … a1a1 a 3 … a n b2b2 b 3 … b n c1c1 c 2 … a2a2 b1b1 cncn A BC QueryDocument

9 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 9 Two Refined Streaming Schemes(1) For this query, tag+level streaming scheme can guarantee the optimality. a1a1 b1b1 a2a2 anan cncn … b2b2 c1c1 bnbn c n -1 … a1a1 a 3 … a n b2b2 b 3 … b n c1c1 c 2 … a2a2 b1b1 cncn A BC QueryDocument

10 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 10 Two Refined Streaming Schemes(1) But given a more complex query and document, tag+level cannot guarantee the optimality. For example: a1a1 e1e1 a2a2 b2b2 d2d2 b1b1 d3d3 c2c2 a1a1 d1d1 d 2,d 3 a2a2 b2b2 A DB Query Document C d1d1 c1c1 b1b1 c1c1 c2c2

11 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 11 Two Refined Streaming Schemes(2) Prefix Path Streaming (PPS): elements with the same root-to-node path are grouped together a1a1 a2a2 d1d1 b2b2 Document a1a1 e1e1 a2a2 b2b2 d2d2 b1b1 d3d3 c1c1 d1d1 D: d2d2 b1b1 c1c1 d3d3 c2c2 Every element in the document is stored as an individual stream in this example. e1e1 c2c2

12 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 12 Two Refined Streaming Schemes(2) PPS is optimal for the following example. a1a1 e1e1 a2a2 b2b2 d2d2 b1b1 d3d3 c2c2 a1a1 d1d1 a2a2 b2b2 A DB Query Document C d1d1 c1c1 b1b1 c1c1 d2d2 c2c2 d1,d2,c1,c2 are separated to different streams

13 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 13 Two Refined Streaming Schemes(2) A natural question : Can PPS guarantee to be optimal for all queries and data?

14 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 14 Two Refined Streaming Schemes(2) A natural question : Can PPS guarantee to be optimal for all queries and data? The answer is NO. For example: a1a1 b1b1 b2b2 b3b3 c2c2 a3a3 b5b5 a4a4 b4b4 a2a2 c1c1 e1e1 d1d1 e2e2 d2d2 A C B ED c1, c2 are in the same stream. Similarly, e1, e2 are also in the same stream. Document Query : head element

15 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 15 A general algorithm: iTwigJoin We propose a general algorithm, called iTwigJoin, which can be used on various data streaming schemes. Our key idea is to classify all current head elements to three classes: Subtree-matching Useless Blocked

16 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 16 Classifying Head Elements Subtree-Matching Element Element e of tag E is called a subtree-matching element for query Q e is in a match to Q E (Q E is the sub-tree of Q rooted at E); and NOT in any future match to Q P where P is the parent of E in Q Useless Element Element e is called a useless element if e is not in any future match to Q E. Blocked Element An element which is neither subtree-matching nor useless

17 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 17 Example: Classifying Head Elements (Tag+Level Streaming) a1a1 e1e1 a2a2 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 a1a1 d1d1 d 2 d 3 … b1b1 a2a2 b2b2 c1c1 c2c2 A DB C D: Q1: : head element

18 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 18 Example: Classifying Head Elements (Tag+Level Streaming) a1a1 e1e1 a2a2 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 a1a1 d1d1 d 2 d 3 … b1b1 a2a2 b2b2 c1c1 c2c2 A DB C D: Q1: : head element Subtree- matching useless blocked d1

19 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 19 Example: Classifying Head Elements (Tag+Level Streaming) a1a1 e1e1 a2a2 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 a1a1 d1d1 d 2 d 3 … b1b1 a2a2 b2b2 c1c1 c2c2 A DB C D: Q1: : head element Subtree- matching useless blocked d1,c1

20 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 20 Example: Classifying Head Elements (Tag+Level Streaming) a1a1 e1e1 a2a2 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 a1a1 d1d1 d 2 d 3 … b1b1 a2a2 b2b2 c1c1 c2c2 A DB C D: Q1: : head element Subtree- matching - useless - blocked d1,c1,a1,a2,b2,b1

21 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 21 Example: Classifying Head Elements (Tag+Level Streaming) a1a1 e1e1 a2a2 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 a1a1 d1d1 d 2 d 3 … b1b1 a2a2 b2b2 c1c1 c2c2 A DB C D: Q1: : head element Subtree- matching - useless - blocked d1,c1,a1,a2,b2,b1 A DB Q2: Subtree- matching useless blocked C

22 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 22 Example: Classifying Head Elements (Tag+Level Streaming) Subtree- matching - useless - blocked d1,c1, a1,a2,b2,b1 a1a1 e1e1 a2a2 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 a1a1 d1d1 d 2 d 3 … b1b1 a2a2 b2b2 c1c1 c2c2 A DB C D: Q1: : head element A DB C Q2: Subtree- matching d1 useless blocked

23 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 23 Example: Classifying Head Elements (Tag+Level Streaming) Subtree- matching - useless - blocked d1,c1, a1,a2,b2,b1 a1a1 e1e1 a2a2 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 a1a1 d1d1 d 2 d 3 … b1b1 a2a2 b2b2 c1c1 c2c2 A DB C D: Q1: : head element A DB C Q2: Subtree- matching d1 useless a1,b2 blocked

24 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 24 Example: Classifying Head Elements (Tag+Level Streaming) Subtree- matching - useless - blocked d1,c1, a1,a2,b2,b1 a1a1 e1e1 a2a2 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 a1a1 d1d1 d 2 d 3 … b1b1 a2a2 b2b2 c1c1 c2c2 A DB C D: Q1: : head element A DB C Q2: Subtree- matching d1 useless a1,b2 blocked c1

25 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 25 Example: Classifying Head Elements (Tag+Level Streaming) Subtree- matching - useless - blocked d1,c1, a1,a2,b2,b1 a1a1 e1e1 a2a2 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 a1a1 d1d1 d 2 d 3 … b1b1 a2a2 b2b2 c1c1 c2c2 A DB C D: Q1: : head element A DB C Q2: Subtree- matching d1 useless a1,b2 blocked c1, b1, a2,

26 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 26 Example: Classifying Head Elements (Tag+Level Streaming) Subtree- matching - useless- blockeda1,a2,b1,b2,c1,d1 a1a1 e1e1 a2a2 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 A D B C Subtree- matching d1, uselessa1,b2 blockeda2,b1,c1 A D B C Useless element can be discarded safely sub-tree Matching element is pushed to the corresponding stack Blocked element causes problem CANNOT be discarded because it may cause loss of results CANNOT be pushed to stack because it may cause useless results When all head elements are blocked; optimal holistic matching CANNOT be guaranteed

27 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 27 iTwigJoin In our algorithm, in order to output all correct answers, we push blocked elements into stack, which may result in useless intermediate results in some cases. a1a1 e1e1 a2a2 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 A DB C Q1: Tag+Level Streaming

28 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 28 iTwigJoin In our algorithm, in order to output all correct answers, we push blocked elements into stack, which may result in useless intermediate results in some cases. a1a1 e1e1 a2a2 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 A DB C Q1: Since all head elements are blocked, we have to push a1 to stack and output one path solution (a1,d1). Tag+Level Streaming

29 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 29 iTwigJoin In our algorithm, in order to output all correct answers, we push blocked elements into stack, which may result in useless intermediate results in some cases. a1a1 e1e1 a2a2 b2b2 d2d2 b1b1 d3d3 c1c1 d1d1 A DB C Q1: If there is no c2, then (a1,d1) is a useless path solution. Since all head elements are blocked, we have to push a1 to stack and output one path solution (a1,d1). Tag+Level Streaming c2c2

30 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 30 iTwigJoin Stream Manager a1a1 c1c1 c 2 c 3 … b1b1 a2a2 b2b2 Temporary Storage SASA SBSB SCSC Two Main Components Stream Manager: Control the advance operation of streams and send elements for temporary storage Temporary Storage: Push elements to stack and output intermediate paths.

31 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 31 Flowchart of iTwigJoin Label current head elements as either subtree-Matching, Useless or Blocked Discard Useless elements Select a subtree-Matching or blocked element e Pop some elements from stack Push e to the stack and output intermediate paths if e is the leaf If useless element is found If not all streams end

32 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 32 Optimal classes of iTwigJoin for three streaming schemes Tag StreamingA-D only pattern Optimal classStreaming scheme A-D only

33 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 33 Tag StreamingA-D only pattern Tag+Level StreamingA-D/P-C only pattern Optimal classStreaming scheme A-D/P-C only A-D only Optimal classes of iTwigJoin for three streaming schemes

34 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 34 Tag StreamingA-D only pattern Tag+Level StreamingA-D/P-C only pattern Prefix Path Streaming Optimal classStreaming scheme A-D/P-C only or 1-Branch node A-D/P-C only A-D only A-D/P-C only or 1-Branch Optimal classes of iTwigJoin for three streaming schemes

35 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 35 Tag StreamingA-D only pattern Tag+Level StreamingA-D/P-C only pattern Prefix Path Streaming A-D/P-C only or 1-Branch Optimal classStreaming scheme A-D/P-C only or 1-Branch node A-D/P-C only A-D only More refined Optimal class:Larger Optimal classes of iTwigJoin for three streaming schemes

36 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 36 Experiments Benchmarks XMark: Synthetic Data Treebank: Real Data from Wall Street Journal

37 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 37 Experiments: I/O Performance Tree1: A-D only Tree2: P-C only Tree3: P-C only Tree4: 1-branchnode Tree5: 1-branchnode By pruning irrelevant streams, PPS usually scan the fewest number of elements.

38 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 38 Experiments: Number of Intermediate Paths Tree1: A-D only Tree2: P-C only Tree3: P-C only Tree4: 1-branchnode Tree5: 1-branchnode For treebank 5, there is no matching results. So Tag+Level and PPS do not output any intermediate results.

39 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 39 Experiments: Running Time XMark1: Path Pattern, XMark2: A-D only, XMark3: P-C only, XMark4: 1-branchnode, XMark5: Non-optimal, Tag+level and PPS have better performance than TwigStack and TwigStackList in XMark data.

40 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 40 Experiments: Summary Both PPS and Tag+Level help to reduce I/O costs. while PPS saves more. PPS may result in too many streams for deep XML data; Tag+Level seems to be a good compromise. PPS and Tag+Level completely avoid the output of redundant intermediate paths in all cases we tested, though they cannot guarantee the optimality in theory.

41 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 41 Conclusions We develop a general algorithm to perform holistic twig join on Tag+Level and PPS streaming schemes. We identify two I/O optimal classes for Tag+Level and PPS streaming schemes. Since our experiments show that Tag+Level streaming schemes can guarantee to produce very few useless intermediate results in most cases, we recommend to use Tag+Level scheme for efficient XML twig pattern matching.

42 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 42 END Thank you! Q & A

43 On Boosting Holism in XML Twig Pattern Matching using Structural Indexing 43 Backup iTwigJoin Algorithm While(not all streams end) 1.Label current head elements as either Matching, Useless or Blocked 2.If any head element is Useless, discard it and continue 3.Let e 1 be the matching element with the smallest startPos; Let e 2 be the blocked element with the smallest endPos; 4.If e 2.endPos < e 1.startPos, let e be the blocked element with the smallest startPos; else let e be e 1 5.Advance the stream e belongs to 6. Pop out elements from es stack whose endPos < e.startPos 7. Push e into its stack if e has a parent/ancestor in the temporary storage system, 8. Output all paths involving e If the tag of e is a leaf node in Q


Download ppt "On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling."

Similar presentations


Ads by Google