Presentation is loading. Please wait.

Presentation is loading. Please wait.

Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.

Similar presentations


Presentation on theme: "Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010."— Presentation transcript:

1 Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010

2 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing2 A little bit of history Database world 1970 relational databases 1990 object oriented database 1995 semi-structured databases Document world 1974 SGML (Structured Generalized Markup Language) 1990 HTML (Hypertext MarkupLanguage) 1992 URL (Universal Resource Locator) 1996 XML (eXtensible Markup Language)

3 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing3 What is XML The eXtensible Markup Language (XML) is the universal format for structured documents and data on the Web. Advantages of XML: Human- and machine-readable format More flexible than HTML, not so complicated as SGML Unlike relational table, XML can describe tree and graph structural data

4 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing4 What is XML Basic Specification: XML 1.0, W3C Recommendation Feb98 The politics of experience Ronald Laing

5 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing5 XML Tree An XML document is commonly modeled as a rooted, ordered tree. r title author 1967 firstnam e lastname The politics… LazingRonald year is an attribute

6 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing6 XML query language Major standards for querying XML data XPath and XQuery XPath is a language for addressing parts of an XML document XPath 1.0 W3C, Nov 1999 E.g. paper [title=XML]/author XQuery is an XML query language which provide features for retrieving and interpreting information from XML documents. XQuery 1.0 Nov 2005

7 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing7 An XQuery example XQuery: { for $b in doc("bib.xml")/bib//book, $t in $b/title, $a in $b/author, return { $t } { $a } } Create a flat list of all the title-author pairs for every book in bibliography.

8 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing8 XML Twig Pattern XML Twig Pattern Query (TPQ) is a core operation in XPath and XQuery Definition of XML twig pattern : an XML twig pattern is a small tree whose nodes are tags, attributes or text values; and edges are either parent-child (P-C) or ancestor-descendant (A-D) relationships

9 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing9 An XML twig pattern example XQuery: { for $b in doc("bib.xml")/bib//book, $t in $b/title, $a in $b/author, return { $t } { $a } } $b $t:$a: To answer the XQuery, we need to first match the following XML twig pattern: bib book title author Create a flat list of all the title- author pairs for every book in bibliography.

10 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing10 Research Problem Given an XML twig pattern Q, and an XML database D, we need to find ALL the matches of Q on D efficiently. E.g. Consider the following twig pattern and document: Twig pattern: section title figure An XML tree: s1 s2 f1 p1 t1 t2 Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1)

11 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing11 Why research XML twig pattern match An XML query includes two parts: value match and twig match. Twig Match: New challenge! XPath: paper [title=XML]/author Value (content) match paper title autho r

12 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing12 Approach Overview (1) Labeling: Assign each element in the XML document tree an integer label to capture the structural information of documents (2) Computing: Use labels to answer the twig pattern without traversing the original document

13 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing13 Related work graph XML TPQ Algorithms Containment scheme [SIGMOD01] Labeling schemes Computing algorithms Stack-merge [ICDE 02] Dewey scheme [ SIGMOD02 ] TwigStack [SIGMOD 02] Twig2Stack [VLDB06] TJFast [VLDB 05] XPath-SQL [SIGMOD 02] TreeMatch[ TKDE2010] Dynamic Dewey scheme [ SIGMOD09 ]

14 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing14 Approach Overview (1) Labeling Region encoding (or called containment) labeling scheme (start,end,level) An example XML tree with region encoding labels s1 s2 f1 p1 t1 t2 (1,12,1) (2,3,2) (5,6,3) (4,11,2) (7,10,3) (8,9,4)

15 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing15 Approach Overview (1) Labeling Dewey (or called prefix) labeling scheme: integer sequence An example XML tree with Dewey labels s1 s2 f1 p1 t1 t ε

16 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing16 Approach Overview (2) Computing Inverted data list: each data list contains all labels of elements with the same tag name Query: s An XML tree: tf s(1,12,1), t f (2,3,2), (8,9,4) Data lists: s1 s2 f1 p1 t1 t2 (1,12,1) (2,3,2) (5,6,3) (4,11,2) (7,10,3) (8,9,4) (5,6,3) (4,11,2)

17 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing17 Previous work: TwigStack [1] (2) Computing TwigStack [1] is a holistic algorithm for XML twig matching on containment labeling scheme. Two steps in TwigStack : (1) intermediate path solutions are output to match each query root-to-leaf path; and (2) these intermediate path solutions are merged to get the final results. [1] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In Proceedings of ACM SIGMOD, 2002.

18 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing18 Running example: TwigStack algorithm s tf Query: s(1,12,1) t f (2,3,2) (8,9,4) Data streams: (5,6,3) (4,11,2) State of stacks: Output path intermediate solutions: (1,12,1) (2,3,2) s//t: (1,12,1) (5,6,3) (4,11,2) (5,6,3) s//f: (1,12,1) (8,9,4) (4,11,2) (8,9,4) Final results: (1,12,1) (2,3,2) (8,9,4) (1,12,1) (5,6,3) (8,9,4) (4,11,2) (5,6,3) (8,9,4) (1,12,1)(4,11,2) (2,3,2) (5,6,3) (8,9,4)

19 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing19 Limitations of TwigStack (1) TwigStack may output many useless intermediate results for queries with parent-child relationship (2) TwigStack cannot process XML twig queries with ordered predicates, like Proceeding, Following in XPath (3) TwigStack cannot answer queries with wildcards in branching nodes. E.g. * B C The parent of B should be an ancestor of C

20 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing20 Outline Introduction Holistic algorithms: TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010) Benchmark experiments Conclusions and future work

21 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing21 Inefficiency of TwigStack TwigStack is inefficient to answer twig query with parent-child edges More than 99% intermediate results are useless, TwigStack wastes too much time to output useless intermediate results! Q1=VP[/DT]//PRP DOLLAR, Q2=S[/JJ]/NP, Q3=S[//VP/IN]//NP in Tree Bank data # of intermediate path

22 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing22 Example to illustrate the inefficiency of TwigStack for queries with P-C edge Twig pattern: A B D C An XML tree: A1A1 E1E1 D1D1 B1B1 TwigStack outputs the useless root-to-leaf intermediate path solutions: (A 1, B 1, C 1 ), (A 1, B 2, C 1 ) …… (A 1, B n, C n ) B n-1 B2B2 BnBn …… C1C1 C n-1 C2C2 CnCn The reason for the inefficiency of TwigStack : TwigStack assumes that all edges are A-D relationships in the first step and does not consider level information The reason for the inefficiency of TwigStack : TwigStack assumes that all edges are A-D relationships in the first step and does not consider level information

23 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing23 Naïve improvement is incorrect Twig pattern: A B D C An XML tree: A1A1 E1E1 D1D1 B1B1 Naïve improvement: because A 1 is not the parent of D 1, we do not output the following path solutions (A 1, B 1, C 1 ), (A 1, B 2,C 1 ) …… (A 1, B n, C n ) by considering level information B n-1 B2B2 BnBn …… C1C1 C n-1 C2C2 CnCn But this naïve approach is NOT correct for some cases !

24 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing24 Problem of naïve approach Naïve approach possibly make a wrong decision about whether the current element contributes to final results Example: Twig pattern: A B C D An XML tree: A1A1 C1C1 D1D1 C2C2 B1B1 CnCn D2D2 When we read A 1, B 1, C 1 and D 1, since C 1 is not the parent D 1, according to the naïve approach, we decide that C 1 and D 1 do not belong to query answers. But it is wron g! DmDm ……

25 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing25 Our solution: Look-ahead New technique used in our new algorithm called TwigStackList: Look-ahead Twig pattern: A B C D An XML tree: A1A1 C1C1 D1D1 C2C2 B1B1 CnCn D m+1 When we read A 1, B 1, C 1 and D 1, we do not hurriedly decide whether C 1 or D 1 belongs to final solutions, but buffer C 1 to C n in the a main-memory list structure. Since C n is the parent, we are sure that (A 1, B 1, C n, D 1 ) is a real match. DmDm …… Why not buffer D 1 to D m ? Too many!

26 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing26 Running example: TwigStackList algorithm Query: A(1,11,1) B (3,10,2) Data streams: XML tree: A1A1 C1C1 D1D1 C2C2 B1B1 C3C3 D2D2 A B C D C D (1,11,1) (2,2,2) (4,8,3) (5,7,4) (6,6,5) (9,9,3) (3,10,2) (4,8,3)(5,7,4) (6,6,5)(9,9,3) SASA SBSB SCSC SDSD List L C (5,7,4) Output path solutions: (1,11,1) (2,2,2) A//B A//C/D (1,11,1) (5,7,4) (6,6,5) (3,10,2) (1,11,1) (3,10,2) (9,9,3) (1,11,1) (2,2,2) (3,10,2) (4,8,3)(5,7,4) (9,9,3)(6,6,5)

27 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing27 Features of TwigStackList Main memory efficient Size of stack and list is no more than |Depth(Tree)| TwigStackList can process very large documents with small main memory cost I/O efficient Each element is scanned once For a large query class, TwigStackList guarantees that each output path solution is useful to final answers.

28 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing28 Optimal query classes If an algorithm does not output any useless intermediate path solution for a query Q for all given documents, we call this algorithm is optimal with respective to Q If an algorithm has a larger optimal query class, this algorithm has better ability to control the size of intermediate results

29 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing29 Optimal query classes. Only A-D in branching edges A BC C A B D D Optimal Class of TwigStack Optimal Class of TwigStackList Only A-D in all edges

30 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing30 Outline Introduction Holistic algorithms: TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010) Benchmark experiments Conclusions

31 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing31 Motivation TwigStack and TwigStackList cannot handle order-based twig query. XPath and XQuery includes ordered axes such as following, preceding, following-sibling and preceding-sibling. A/B[following-sibling::C] XPath expression A B C < This symbol shows that B and C are ordered.

32 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing32 Ordered twig query pattern Ordered XML twig pattern : sibling query nodes should be matched according to their order in the twig query. Example A B C < D A1 B1D1 C1 D2 D3 Only D 2 and D 3 contribute to final results.

33 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing33 OrderedTJ OrderedTJ, a new algorithm proposed for evaluating ordered twig query pattern. OrderedTJ, which extends TwigStackList, also uses stack and list data structure Whats the main modification of OrderedTJ over TwigStackList? OrderedTJ additionally checks the order conditions of elements before outputting intermediate paths.

34 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing34 OrderedTJ Before any element is pushed to the stack, OrderedTJ checks the order condition A B C < A1 B1D1 DataQuery A(1,9,1) B Data streams: C (3,5,2) (4,4,3) C1 D2 (1,9,1) (2,2,2)(3,5,2) (6,8,2) (7,7,3) SASA SBSB SDSD Output intermediate path solutions: A/B/C (1,9,1) (3,5,2) (4,4,3) A//D (1,9,1) (6,8,2) D D3 SCSC (4,4,3) D(2,2,2) (6,8,2) (7,7,3) (1,9,1) (3,5,2) (4,4,3) (1,9,1) (7,7,3) (6,8,2) (7,7,3)

35 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing35 The optimal query classes of OrderedTJ OrderedTJ can guarantee the optimality for ordered queries with A-D relationships from the second branching edges. In other words, OrderedTJ is optimal for queries with P-C relationship in the first branching edges. A B C < OrderedTJ is Optimal for Q2 A B C TwigStackList is non-optimal for Q1. Q1Q2

36 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing36 Outline Introduction Holistic algorithms: TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010) Benchmark experiments Conclusions and future work

37 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing37 iTwigJoin algorithm TwigStack and OrderedTJ partition data to streams according to their tag names alone We propose two new data partition schemes (1) Tag+level scheme (2) Prefix path scheme Potential benefits: Enlarge the optimal query classes Reduce I/O cost

38 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing38 Data partition scheme A1 C2 C1 B1 C3 TATA A1 TBTB TCTC C1, C2, C3 Tag partition B1 Tag+Level partition A1 C2 B1 C1, C3 Prefix Path partition TATA A1 T AB T AC C2 B1 T ABC C1 C3T ACC Tag partition Tag +level partition Refined By level Prefix path partition Refined By path T2T2 B T1T1 A T2T2 C T3T3 C

39 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing39 Property of three schemes 1. the number of inverted lists : increasing (CPU cost increase correspondingly) 2. the optimal query classes : enlarging (output cost decrease correspondingly) 3. the number of elements scan : decreasing (input cost decrease correspondingly) Tag scheme Tag +level scheme Refined By level Prefix path scheme Refined By path

40 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing40 The number of inverted lists : increasing A1 C2 C1 B1 C3 TATA A1 TBTB TCTC C1, C2, C3 Tag partition B1 Tag+Level partition A1 C2 B1 C1, C3 Prefix Path partition TATA A1 T AB T AC C2 B1 T ABC C1 C3T ACC T2T2 B T1T1 A T2T2 C T3T3 C

41 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing41 The optimal query classes : enlarging Only A-D in branching edges and only P-C in all edges and only 1-branching A BC C A B D D Optimal class of tag scheme Optimal Class of tag+level scheme Only A-D in branching edges Only A-D in branching edges and only P-C in all edges A BC Optimal Class of prefix path scheme E EE D

42 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing42 The number of elements scan : decreasing TATA A1 TBTB TCTC C1, C2 Tag scheme B1 Tag+Level scheme A1 C1 B1 C2 Prefix Path scheme TDATDA A1 T DAB T DC C1 B1 T DCC C2 T3T3 B T2T2 A T2T2 C T3T3 C A B C Query Data D1 C1 B1 A1 C2 1: 2: 3:

43 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing43 iTwigJoin algorithm A general algorithm which can be applied on all three schemes For different schemes, iTwigJoin achieves different performance. The main technical difficult in designing iTwigJoin is to handle many current nodes for one tag name. We classify the current visited elements to three categories: current-match, current- useless and current- blocked

44 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing44 Three kinds of elements Current-match : the element is guaranteed to contribute to final answers with current elements. Current-useless : the element is guaranteed not to contribute to final answers with current and remaining elements. Current-blocked : the element is neither current-match nor current-useless. Current-blocked Match Useless Matching data appears Cannot get any matching data

45 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing45 Example on three kinds of elements A B C A1 A3 B2 B1 C1 A1 B1 Tag+level scheme C2 B2 Query A2C2 Document A2, A3 1: 2: 3: C1 Current-blocked : B2,C1 Current-match : A1,B1,C2 Current-useless : A2 T2T2 B T2T2 A T3T3 B T3T3 C T2T2 C T1T1 A

46 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing46 Example on three kinds of elements A B C A1 A3 B2 B1 C1 A1 B1 Tag+level scheme C2 B2 Query A2C2 Document A2, A3 1: 2: 3: C1 B2,C1 are converted from current-blocked to current-match due to the appearance of A3. T1T1 A T2T2 A T2T2 B T3T3 B T2T2 C T3T3 C

47 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing47 Main flowchart of iTwigJoin Is there any current-useless element? Is there any current-match element? Choose the smallest current-blocked element and output intermediate path solutions, then advance to the next element See whether it contributes to previous match, and advance to the next element Output intermediate path solutions, and advance to the next element Are all elements scanned? End of the algorithm N Y N N Y Y

48 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing48 Outline Introduction Holistic algorithms: TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010) Benchmark experiments Conclusions and future work

49 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing49 Motivation: new labeling scheme TwigStackList, OrderedTJ and iTwigJoin are all based on the containment labeling scheme Why not try Dewey labeling scheme for XML twig pattern query ? Oh, it is really a novel idea!

50 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing50 Original Dewey Labeling Scheme In Dewey labeling scheme, each element is presented by an integer sequence: (i) the root is labeled by a empty string ε (ii) for a non-root element u, label(u)= label(s).x, where u is the x-th child of s. For example: s1 s2 f1 f2 t1 t ε

51 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing51 Main problem of the original Dewey If we use the original Dewey labeling scheme to answer the twig query, we need to read labels for all query node. Thus, this is not a better solution than pervious algorithms. Extend the original Dewey labeling scheme so that given the label of any element e, we can know the path of e from this label alone

52 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing52 Modular function We need to know some schema information: DTD (Document Type Definitions ) or XML schema Given DTD information: book author, title, chapter* Our solution: using modular function, we create a match between an element tag and an integer number. We define X author mod 3 = 0 X title mod 3 = 1 X chapter mod 3 = 2; where, X t is the last integer of the label of tag t. book ε 0 title author 1 chapter 2 5 Why not 3 as the original Dewey ? The number of distinct tags under book

53 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing53 Derive element tag From a label, we can derive its tag name. book author, title, chapter* Recall that we define: X author mod 3 = 0 X title mod 3 = 1 X chapter mod 3 = 2. book ε 0 title author 1 chapter 2 5 ? ?? ?

54 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing54 More examples for assigning labels Let us consider a more complicated DTD a (b | c )*, d?, c+ We define: X b mod 3 = 0 X c mod 3 = 1 X d mod 3 = 2 (Why do we use mod 3 instead of 4?) a ε 0 d b 2 c 4 c 7

55 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing55 Derive the path from a label By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label. For example: DTD: book author, title, chapter* chapter (paragraph | section)* section (paragraph | section)* book chapter section author title book author title chapter paragraph section Mod 3=0 Mod 3=1 Mod 3=2 Mod 2=0 Mod 2=1 Mod 2=0 Mod 2=1 Question: Given a label 5.1.0, what is the corresponding path ? Document: FST: chapter section paragraph section

56 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing56 Derive the path from a label By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label. For example: DTD: book author, title, chapter* chapter (paragraph | section)* section (paragraph | section)* book chapter section author title Document: chapter section paragraph section Following the above red path, we get denotes : book/ chapter/section/paragraph book author title chapter paragraph section Mod 3=0 Mod 3=1 Mod 3=2 Mod 2=0 Mod 2=1 Mod 2=0 FST: Mod 2=1

57 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing57 Two properties of extended Dewey Find Ancestor Label From a label of any element, we can derive the labels of its all ancestors. Find Ancestor Name From a label of any element, we can derive the tag names of its all ancestors. Two properties enable us to design a new and efficient algorithm for XML twig pattern matching.

58 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing58 A new algorithm: TJFast For each node n in the query, there exists a corresponding input stream T n. T n contains the extended Dewey labels of elements of tag n. Those labels are arranged by the document order. For each branching node b of twig pattern, there is a corresponding set S b, which contains elements possibly involving query answers. (Compared to TwigStackList, what difference? ) During any point of computing, the size of set S b is bounded by the depth of the XML document.

59 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing59 An example for TJFast algorithm Document:Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, TD:TD: TC:TC: DTD: a -> a*,d*, b* b -> d*, c* d -> c* Root 0 … A set for the branching node A Why are there only two streams? { }

60 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing60 An example for TJFast algorithm Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, Root 0 … a1/a2/d1 derive a1/a3/b1/c1 derive By finite state transducer of extended Dewey labeling scheme TD:TD: TC:TC: { }

61 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing61 An example for TJFast algorithm Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, Root 0 … Both a1 and a3 possibly involve in query answers. (Why not a2 ?) TD:TD: TC:TC: { }

62 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing62 Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, Root 0 … Then we insert a1, a3 to the set, Output Path solutions: A//D A/B//C (a1, d1) (a3, b1, c1) TD:TD: TC:TC: An example for TJFast algorithm {a1,a3}

63 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing63 Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, Root 0 … Move the cursor of T D from d1 to d2 TD:TD: TC:TC: An example for TJFast algorithm Output Path solutions: A//D A/B//C (a1, d1) (a3, b1, c1) (a1, d2) (a3, d2) {a1,a3}

64 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing64 Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, Root 0 … Move the cursor of stream T D from d2 to d3 TD:TD: TC:TC: An example for TJFast algorithm Output Path solutions: A//D A/B//C (a1, d1) (a3, b1, c1) (a1, d2) (a3, d2) (a1, d3) {a1,a3}

65 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing65 Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, Root 0 … Move the cursor of stream T C from c1 to c2 TD:TD: TC:TC: An example for TJFast algorithm Output Path solutions: A//D A/B//C (a1, d1) (a3, b1, c1) (a1, d2) (a1, b2, c2) (a3, d2) (a1, d3) {a1,a3}

66 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing66 Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 A// D:, A/B//C:, Phase 1. Intermediate paths,, Phase 2. Final solutions Join Sort and merge-join in TJFast

67 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing67 TJFast+L Apply extended Dewey labeling scheme on tag+level streaming scheme, we propose TJFast+L algorithm by extending TJFast Two benefits of TJFast+L over TJFast reduce I/O cost by reading less elements enlarge optimal query classes Q: Why not apply extended Dewey on Prefix-path scheme ? Because by finite state transducer, we can know the path information…

68 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing68 Optimal query classes. Only P-C in all edges A BC C A B D D Optimal Class of TJFast Optimal Class of TJFast+L Only A-D in branching edges

69 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing69 Outline Introduction Holistic algorithms: TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010) Benchmark experiments Conclusions and future work

70 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing70 State-of-the-art: XML Query Processing Path Tree Holistic Approach PathStack [Bruno, et. al]TwigStack [Bruno, et. al] (GTP) Generalized Tree Pattern ? Twig 2 Stack

71 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing71 Processing Generalized Tree Pattern (GTP) Queries B A D XQuery: FOR $b in //A[E]//B, $d in $b/$D LET $c = $b/C RETURN $b, $c, $d C E Mandatory Axis Optional Axis Return node Group return node

72 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing72 Motivation: PathStack [Bruno et.al] Query: //A//B; Data: Key observation: minimize intermediate results through compact representation of path matches, by Inter-node: record AD relationship between elements in different query nodes, e.g., b1a2, b2a2 Intra-node: record AD relationship between elements within the same query nodes, e.g., b1, b2 TwigStack [Bruno et.al] minimizes intermediate results through: Output only those path matches that are in final twig results However, such optimality cannot be guaranteed [Choi, et.al] Not helpful for processing GTP queries Question: can we minimize intermediate results for twig queries through compact result encoding (similar to PathStack)? S[A] a1 S[B] b1 b2a2 b1 a1 b2

73 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing73 Hierarchical Stack Encoding Inter-node: //A//B Can still use explicit edges Intra-node: A Matching elements forms a tree structure as well Associate each query node with a hierarchical stack Push element e into hierarchical stack HS[E] iff e satisfies the sub-twig query rooted at E Matching can be determined when entire sub-tree of e seen Require post-order document traversal a2 a3a4 a1 HS[A] a3a4 a2 a1

74 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing74 Twig 2 Stack: Running Example C B A D a2 c1 b2 b1 d1 a1 [1,20], 1 [2,15], 2 [3,14], 3 [4,11], 4 [8, 9], 6 [5,10], 5 d2 [6,7], 6 c2 [12,13], 4 b3 d3 [16,19], 2 [17,18], 3 HS[B] b2 HS[C] c1 b1 HS[A] a2 HS[D] d2 d1 c2d3 TwigStack needs to enumerate 3 matches for //A/B//D and 2 for //A/B//C then join them together. Twig 2 Stack requires neither path joins nor path enumeration! Merging Stacks

75 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing75 Not yet done: Memory Usage Hierarchical Stack Encoding could hold entire document in memory in the worst case Unlike DOM approach, only matches need to be stored Tag match (Partial) twig match Predicate evaluation Early result enumeration dramatically reduces the memory usage Enumerate query results before the end of document and release buffer Main idea: hybrid of top-down (PathStack) and bottom-up (Twig 2 Stack) approaches

76 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing76 Outline Introduction Holistic algorithms: TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010) Benchmark experiments Conclusions and future work

77 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing77 TreeMatch (TKDE 2010) Twig pattern: A BC An XML tree: A1A1 C1C1 B1B1 A2A2 B2B2 C2C2 It is the real reason for sub- optimality ! B1 B2 C1 C2 Matching cross:

78 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing78 Bounded and Unbounded Matching Cross Twig pattern: A BC An XML tree: A1A1 C1C1 B1B1 A2A2 B2nB2n C 2n B 1 B 2n C 1 C 2n Unbounded Matching cross: AnAn BnBn … B n+1 C 2n-1 CnCn … …… A 1 A n C 1 C 2n …… Bounded Matching cross:

79 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing79 BMC and UMC Bounded Matching Cross (BMC): Optimal class Store limited number of nodes in main memory Unbounded Matching Cross (UMC): Sub-optimal class, but not all Cannot guarantee to store limited number of nodes in main memory, but a sub-class of UMC is still optimal

80 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing80 Unbounded Matching Cross with Mediator Twig pattern: (output: node C) A BC An XML tree: A1A1 B1B1 A2A2 CnCn B 1 B n+1 C 1 C n Unbounded Matching cross: BnBn … B n+1 C 1 …… AnAn … B 2n C n-1 Node A is a mediator node and we do not need to store all B i in main memory !

81 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing81 Optimal query classes Only A-D in branching edges A BC C A B D D Optimal Class of TwigStack Optimal Class of TwigStackList Only A-D in all edges C A B Only A-D in non-output branching edges Optimal Class of TreeMatch

82 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing82 Outline Introduction Holistic algorithms: TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010) Benchmark experiments Conclusions and future work

83 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing83 Experiment Setup Implementation (Seven algorithms) TwigStack (SIGMOD2002) TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010) Datasets XMark, DBLP, TreeBank Metrics Query processing time IO time

84 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing84 Experiments Benchmarks XMark: Synthetic Data DBLP: Real Data for DBLP database Treebank: Real Data from Wall Street Journal XMarkDBLPTreebank Data size(MB) Nodes(million) Max/Avg depth12/56/2.936/7.8

85 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing85 Tested queries SourceTwig Queries Q1DBLP//proceedings//title[.//i]//sup Q2DBLP//article[.//sup]//title//sub Q3Treebank/S[.//VP/IN]//NP Q4Treebank/S/VP/PP[IN]/NP/VBN Q5Treebank//VP[DT]//PRP_DOLLAR_ Some tested queries

86 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing86 Tested queries (Cont.) Q1,Q2,Q3 are based on XMark data and Q4,Q5 Q6 are on TreeBank data.

87 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing87 TwigStackList V.s.TwigStack Experiment data: TreeBank Compared to TwigStack, TwigStackList significantly reduces the size of output useless elements. Q1=VP[/DT]//PRP DOLLAR, Q2=S[/JJ]/NP, Q3=S[//VP/IN]//NP # of intermediate path

88 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing88 TwigStackList V.s. OrderedTJ STW: Straightforward-TwigStack STWL: Straightforward-TwigStackList OrderedTJ is significantly better than two straightforward method on XMark and TreeBank data

89 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing89 iTwigJoin The decrease of the number of elements scanned More refined schemes scan less elements to answer a query.

90 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing90 iTwigJoin Performance of queries for three streaming schemes Prefix path scheme is suitable for large but shallow document, and tag+level scheme generally works well even for complicated recursive documents.

91 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing91 TwigStackList V.S. iTwigJoin Observation: iTwigJoin scans far less elements than TwigStack and TwigStackList in two twig queries. TreeBank data

92 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing92 TwigStackList V.S. iTwigJoin Observation: iTwigJoin has much better performance than that of TwigStack/TwigStackList. Explanation: iTwigJoin reduces I/O cost by reading less elements TreeBank dataDBLP data

93 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing93 iTwigJoin, TJFast, Twig2Stack, Observation: iTwigJoin/TJFast has better performance than that of Twig2Stack Reason: iTwigJoin/TJFast reduces I/O cost by reading less elements TreeBank dataDBLP data

94 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing94 Experiments: TJFastL and iTwigJoin Observation: Both algorithms are based on tag+level scheme. TJFastL has much better performance than iTwigJoin on tag+level scheme. Explanation: TJFast reduces I/O cost by reading less elements. DBLP data TreeBank data

95 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing95 TJFast and TreeMatch Observation: TreeMatch has much better performance than that of TJFast. Explanation: TreeMatch reduces I/O cost over TJFast. DBLP data TreeBank data

96 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing96 Conclusions Efficient processing of twig queries is a core operation in XPath and XQuery We reviewed and compared seven holistic algorithms TwigStack(SIGMOD 2002) TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010) Comprehensive benchmark experiments show the correctness and efficiency of holistic algorithms

97 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing97 Conclusions (Cont.) Holistic TPQ processing, I/O cost takes most of time TJFast reduces input data size Twig2Stack reduces output size TreeMatch reduces both input and output data size

98 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing98 Reference works [1] J. Lu, T. W. Ling,Z. Bao and C. Wang Extended XML Tree Pattern Matching: Theories and Algorithms IEEE TKDE Journal 2010 (to appear)Extended XML Tree Pattern Matching: Theories and Algorithms Propose TreeMatch algorithm [2] J. Lu, T. Chen, and T. W. Ling. Efficient processing of xml twig patterns with parent child edges: a look-ahead approach. In CIKM, pages , Propose TwigStackList algorithm [3] J. Lu and T. W. Ling, Labeling and querying dynamic XML trees, In Proceedings of the Sixth Asia Pacific Web Conference, 2004, 180–189 Propose a new labeling scheme for dynamic XML documents [4] T. Chen, J. Lu, and T. Ling. On boosting holism in xml twig pattern matching using structural indexingtechniques. In SIGMOD, Propose two new data streaming techniques [5] J. Lu, T. W. Ling, C. Chan, and T. Chen, From region encoding to extended dewey: On efficient processing of XML twig pattern matching, In Proceedings of VLDB, 2005, pp. 193–204. Propose TJFast algorithm

99 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing99 Reference works (Cont.) [6 ] J. Lu, T. W. Ling, T. Yu, C. Li, and W. Ni, Efficient processing of ordered XML twig pattern matching, Proceedings of DEXA, 2005, pp. 300–309 Propose OrderedTJ algorithm [7] J. Lu, T. W. Ling, and T. Chen, TJFast: Effective processing of XML twigpattern matching, Proceedings of WWW, 2005, pp. 1118–1119. Propose extended Dewey labeling scheme [8] T. Yu, T. W. Ling, J. Lu: TwigStackListNot: A Holistic Twig Join Algorithm for Twig Query with Not-Predicates on XML Data. DASFAA Propose an algorithm for twig queries with NOT predicate [9] J, Lu, R Yang, W. Ling, A. K.H Tung: Efficient XML tree pattern matching: theory and algorithm Submit to IEEE TKDE Journal Propose a theory and algorithm for extended XML tree pattern

100 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing100 Reference works (Cont.) [10] S. Al-Khalifa, H.V. Jagadish, J. Patel, Y. Wu N. Koudas, D. Srivastava : Structural Joins: A Primitive for Efficient XML Query Pattern Matching. ICDE Propose StackTree algorithm [11] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In Proceedings of ACM SIGMOD, Propose TwigStack algorithm [12] C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G. M. Lohman, On supporting containment queries in relational database management systems, In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2001, pp. 425–436. Propose containment labeling scheme

101 BenchmarX 10 Keynote Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing101 Reference works (Cont.) [13] H. Jiang, W Wang and H. Lu Holistic twig joins on indexed XML documents VLDB 2003 Propose TSGeneric algorithm [14] I. Tatarinov, S. Viglas, K. S. Beyer, J. Shanmugasundaram, E. J. Shekita, and C. Zhang, Storing and querying ordered XML using a relational database system, In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2002, pp. 204–215. Propose Dewey labeling scheme [15] H. Wang, S. park, W Fan and P.S. Yu ViST: A dynamic index method for querying XML data by tree structures In SIGMOD 2003 Propose ViST system [16] B. Yang M. Fontoura, E.J. Shekita, S. Rajagopalan and K.S. Beyer Virtual Corsors for XML joins CIKM pages Propose Virtual cursor algorithm


Download ppt "Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010."

Similar presentations


Ads by Google