Presentation on theme: "Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig."— Presentation transcript:
Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig pattern in an XML database is a core operation for efficient evaluation of XML queries. Our motivation is: (1) The performance of previous holistic twig join algorithms can be further improved. (2) Algorithm based on region encoding CANNOT answer queries with wildcards in branching nodes. For example. According to region codes, which document, Doc1 or Doc2, matches query? By reading the region encoding of elements a,b,c alone, we CANNOT answer this wildcards branching query. Extended Dewey solve two problems: Wildcards query and Query performance Reference: (1) N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal XML pattern matching. In SIGMOD Conference, pages , (2) J. Lu, T. Chen, and T. W. Ling. Efficient processing of xml twig patterns with parent child edges: a look-ahead approach. In CIKM, pages 533~542, 2004 (3) P. O'Neil et al. ORDPATHs: Insert-friendly XML node labels SIGMOD pages 903~908, (4) I. Tatarinov, et al. Storing and querying ordered XML using a relational database system. In Proc. of SIGMOD, pages 204–215, Given an extended Dewey label, we can use the above finite state transducer to derive its path: For example: bib/book/chapter/section bib/book/chapter/title Experiemntal setting: (1)We use the random data sets (with 3 millions nodes) consisting of seven labels, namely a,b,...,e. The node labels in the data were uniformly distributed. (2) We issue four twig queries: a[.//b]//c, a[./b]/c, a[./b/c]/d/e, a[.//b/c]//d/e, (3) We compare our method with the previous work TwigStack and TwigStackList. To answer a twig pattern query, we propose a new holistic twig join algorithm, called TJFast. Compared to previous algorithms, to answer path and twig queries, we only need to access the labels of leaf nodes, So we significantly reduce I/O cost. For example, given a path query //chapter/section/text, we only access the labels of text to answer this query. Given a twig query: //chapter/section[.//keyword]/text, We only scan keyword and text. TJFast: Effective Processing of XML Twig Pattern Matching [1. INTRODUCTION] [2. Our new labeling scheme: EXTENDED DEWEY] [3. A new holistic algorithm: TJFAST] [4. Preliminary experiments] Tatarinov et al. proposed a Dewey labeling scheme. It can be used to answer this wildcards query. See Fig 2. Since in Doc 1, b and c does not share the same parent, only Doc 2 matches this wildcard queries. But twig join algorithm based on Dewey scheme is not as efficient as that based on region encoding, since the prefix comparison is more time consuming than integer comparison in region encoding. In this paper, we extend Dewey labeling scheme, which not only can be used to answer wildcards queries, but also has better performance than algorithms on region encoding. Figure 1 An example to illustrate the limitation of region encoding Figure 2 An example to answer wildcards query with Dewey scheme Figure 3 An example to answer wildcards query with Dewey scheme Figure 4. DTD for the XML tree in Fig 3. Labeling methods: Given a document and DTD, we use module function to match an integer with the certain tag name. For example: book author, title, chapter Assume x(t) denote the last integer of the label of tag t, then x(author) mod 3 = 1, x(title) mod 3 =2 and x(chapter) mod 3 = 0. The label of any text value ends with 0. Figure 5. A Finite state transducer for DTD in Fig 4. TJFast only need to access the labels of LEAF nodes to answer a query. Resutls analysis: TJFast outperforms TwigStack, TwigStackList under all settings The improvement is due to the facts that TJFast only scan labels for query leaf nodes. Algorithmson region encoding is comparable to TJFast only when the number of elements for internal query nodes is very small.