A New Algorithm for Evaluating Ordered Tree Pattern Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg,

A New Algorithm for Evaluating Ordered Tree Pattern Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R3B 2E9

Outline Motivation Algorithm for tree pattern query evaluation based on ordered tree matching -Tree encoding -Algorithm description Experiment results Summary

Efficient method to evaluate XPath expression queries – XML query processing XML documentsa tree pattern query Motivation

Document: dell IBM part#1 Intel Part#2 Houston Winnipeg Y-Chen P SB IILLNN I M MNN IBMPart#1Part#2 Dell HoustonWinnipegY-Chen Intel Motivation

Document:Query – XPath expressions: Q1: /Purchase[Seller[Loc=‘Boston’]]/ Buyer[Loc = ‘New York’ Q2: /Purchase//Item[Manufacturer = ‘Intel’] Purchase Seller Buyer Location ‘Houston’ ‘Winnipeg’ Buyer Item Manufacturer ‘Intel’ d-edge: ancestor- descendant relationship c-edge: parent-child relationship P SB IILLNN I M MNN IBMPart#1Part#2 Dell HoustonWinnipegY-Chen Intel

Motivation - XPath expression a[b[c and.//d]]/b[c and e//d] book[title = ‘Art of Programming’]//author[fn = ‘Donald’ and ln = ‘Knuth’] a bb cdce d title Art of Programming book author fnln Knuth Donald Art of Programming Donald Knuth … XPath evaluation against XML documents

- Evaluation based on unordered tree matching: Definition An embedding of a tree pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i)Preserve node label: For each u  Q, label(u) matches label(f(u)). (ii)Preserve parent-child/ancestor-descendant relationships: If u  v in Q, then f(v) is a child of f(u) in T; if u  v in Q, then f(v) is a descendant of f(u) in T. Motivation XPath evaluation against XML documents a b c Q:Q: q3q3 q1q1 q2q2 a b c cbb T:T: v1v1 v2v2 v3v3 v4v4 v5v5 v6v6

Motivation XPath evaluation against XML documents - Evaluation based on ordered tree matching XPath expression: s/vp[v = “reads” and /following-sibling::n = “book” and /following sibling::adv] s vp vnadv “reads”“book” s np vp detnvnpadv “The”“student”“reads”detadjn “the”“interesting”“book” “carefully”

Motivation XPath evaluation against XML documents - Evaluation based on ordered tree matching: Definition An embedding of a tree pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i)Preserve node label: For each u  Q, label(u) matches label(f(u)). (ii)Preserve parent-child/ancestor-descendant relationships: If u  v in Q, then f(v) is a child of f(u) in T; if u  v in Q, then f(v) is a descendant of f(u) in T. (iii)Preserve sibling order: For any two nodes v 1  Q and v 2  Q, if v 1 is to the left of v 2, then f(v 1 ) is to the left of f(v 2 ) in T. a b c Q:Q: q3q3 q1q1 q2q2 a b c cbb T:T: v1v1 v2v2 v3v3 v4v4 v5v5 v6v6

Algorithm for query evaluation Tree encoding Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document. (1, 1, 11, 1) (1, 10, 10, 2) B v 8 A v 1 (1, 7, 7, 4) (1, 6, 6, 4) T:T: (1, 5, 5, 4) (1, 3, 3, 3) B v 2 v 3 C B v 4 D v 7 v 5 C (1, 4, 8, 3) (1, 2, 9, 2) v 6 C 1 2 3 4 3 5 6 5 6 7 7 8 9 10 11

Tree encoding Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), denoted as  (v), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document. (i)ancestor-descendant: a node v 1 associated with (d 1, l 1, r 1, ln 1 ) is an ancestor of another node v 2 with (d 2, l 2, r 2, ln 2 ) iff d 1 = d 2, l 1 r 2. (ii)parent-child: a node v 1 associated with (d 1, l 1, r 1, ln 1 ) is the parent of another node v 2 with (d 2, l 2, r 2, ln 2 ) iff d 1 = d 2, l 1 r 2, and ln 2 = ln 1 + 1. (iii)from left to right: a node v 1 associated with (d 1, l 1, r 1, ln 1 ) is to the left of another node v 2 with (d 2, l 2, r 2, ln 2 ) iff d 1 = d 2, r 1 < l 2.

A: (1, 1, 11, 1) B: (1, 2, 9, 2) (1, 4, 8, 3), (1, 10, 10, 2) C: (1, 3, 3, 3) (1, 5, 5, 4), (1, 6, 6, 4) D: (1, 7, 7, 4) Data streams: Algorithm for query evaluation Tree encoding (1, 1, 11, 1) (1, 10, 10, 2) B v 8 A v 1 (1, 7, 7, 4) (1, 6, 6, 4) T:T: (1, 5, 5, 4) (1, 3, 3, 3) B v 2 v 3 C B v 4 D v 7 v 5 C (1, 4, 8, 3) (1, 2, 9, 2) v 6 C sorted by LeftPos values 1 2 3 4 3 5 6 5 6 7 7 8 9 10 11

Algorithm for query evaluation Algorithm description Our algorithm works bottom-up. Therefore, we need to sort XML streams by (DocID, RightPos) values. Each time a query Q is submitted to the system, we will associate each query node q with a data stream L(q) such that for each v  L(q) label(v) = label(q), in which each query node is attached with a list of matching nodes of the document tree. {v1}{v1} A q 1 q 2 B B q 5 q 3 CC q 4 L(q1)L(q1) Q:Q: L(q2)L(q2)L(q5)L(q5) L(q4)L(q4) L(q3)L(q3) L(q 1 ) = (1, 1, 11, 1) - L(q 2 ) = L(q 5 ) = (1, 4, 8, 3), (1, 2, 9, 2) (1, 10, 10, 2) - L(q 3 ) = L(q 4 ) = (1, 3, 3, 3) (1, 5, 5, 4), (1, 6, 6, 4) - { v 4, v 2, v 8 } { v 3, v 5, v 6 } sorted by RightPos values T:T: Q:Q:

Algorithm for query evaluation Algorithm description 1.First, we will number the nodes of Q in postorder. So the nodes in Q will be referenced by their postorder numbers. Additionally, we set a virtual node for Q, numbered 0, which is considered to be to the left of any node in Q. A q 1 B q 5 C q 4 q 2 B q 3 C (q1)(q1) (q2)(q2) 2.For each node q of Q, a link from it to the left-most leaf node in Q[q], denoted by (q), is established. For a leaf node q’, (q’) = q’. 5 3 A q 1 B q 5 C q 4 q 2 B q 3 C Q : 1 2 4 0 virtual node

Algorithm for query evaluation Algorithm description 3.Let q’ be a leaf node in Q. We denote by  -1 (q’) a set of nodes x such that for each q  x (q) = q’. A q 1 B q 5 C q 4 q 2 B q 3 C (q1)(q1) (q2)(q2)  -1 (q 3 ) = {q 1, q 2, q 3 }  -1 (q 4 ) = {q 4 }  -1 (q 5 ) = {q 5 }

Algorithm for query evaluation Algorithm description 4.Each node v in T is associate it with an array A v of length |Q|, indexed from 0 to |Q| - 1. In A v, each entry is a query node or , defined below: q Av:Av: Av[q]Av[q] x1x1 xjxj A v [q] = Max{x | x   -1 (q’)  T[v] embeds Q[x]}, If there is a least leaf q’ larger than q such that  -1 (q’) contains at least one node x with Q[x] being embedded in T[v]; Otherwise.  q’ Q:Q:

Algorithm for query evaluation Algorithm description q Av:Av: Av[q]Av[q] x1x1 xjxj q’ Q:Q: X 1 is the largest ancestor of q’ such that T[v] contains Q[X 1 ]. q’ is the closest leaf node to the right of q. In this way, both the subtree embedding and the left-to-right ordering can be recorded.

Algorithm for query evaluation Setting values in A v (i)If we find Q[x] can be embedded in T[v], we will set A v [q 1 ],..., A v [q k ] to x, where each q l (1  l  k) is a query node to the left of x, to record the fact that x is the closest node to the right of q l such that T[v] embeds Q[x]. 5 3 A q 1 B q 5 C q 4 q 2 B q 3 C Q : 1 2 4 0 B v 8 A v 1 T:T: B v 2 v 3 C B v 4 D v 7 v 5 Cv 6 C Av3:Av3: 0123401234   Av3:Av3: 0123401234 1 

Algorithm for query evaluation Setting values in A v (ii)If some time later we find another node x’, which is to the right of x, such that Q[x’] can be embedded in T[v], we will set A v [p 1 ],..., A v [p s ] to x’, where each p l (1  l  s) is to the left of x’ but to the right of q k. 5 3 A q 1 B q 5 C q 4 q 2 B q 3 C Q : 1 2 4 0 B v 8 A v 1 T:T: B v 2 v 3 C B v 4 D v 7 v 5 Cv 6 C Av3:Av3: 0123401234 1  Av3:Av3: 0123401234 1  2 

Algorithm for query evaluation Setting values in A v (iii)If x’ is an ancestor of x, we will find all those entries pointing to a descendant of x’ on the left-most path in Q[x’]. Replace these entries with x’. (iv)For all the other nodes v’ such that T[v’] embeds Q[x], we will set values for the entries in A v’ in the same way as (i), (ii), and (iii). 5 3 A q 1 B q 5 C q 4 q 2 B q 3 B Q : 1 2 4 0 B v 8 A v 1 T:T: B v 2 v 3 C B v 4 D v 7 v 5 Bv 6 C Av4:Av4: 0123401234 1  Av4:Av4: 0123401234 3 

Algorithm for query evaluation Using A v to check tree embedding 5.Let q in Q and v in T be the nodes encountered. Let v 1,..., v k be the child nodes of v. Let q 1,..., q l be the child nodes of q. We first check A v1 starting from A v1 [h], where h = (q) - 1. We begin the search from (q) - 1 because it is the closest node to the left of the first child of q. Let A v1 [h] = q’. If q’ is not q 1, nor an ancestor of q 1, we will check A v2 [h] in a next step. This process continues until one of the following conditions is satisfied: (i)All A vj ’s have been checked, or (ii)There exists v j such that A vj [h] is q 1 or an ancestor of q 1. q Av1:Av1: h = (q) - 1 q1q1 ? qv label(v) = label(q) q1q1 qlql v1v1 vkvk

Algorithm for query evaluation Using A v to check tree embedding 6.If all A vj ’s are checked (case (i)), it shows that Q[q 1 ] cannot be embedded in any subtree rooted at a child node of v. So T[v] cannot embed Q[v]. If it is case (ii), we know that T[v j ] embeds Q[q 1 ]. If q 1 is a //-child, or both q 1 and v j are /-children, we will continue to check A v(j+1) [g] against q 2, where g = A vj [h]. (Otherwise, we will continue to check A v(j+1) [h] against q 1.) q Av1:Av1: h = (q) - 1 q1q1 Av2:Av2: q2q2 A v2 :[q 1 ]

Algorithm for query evaluation Algorithm description A v3v3 : A v5v5 : A v6v6 : A v4v4 : A v2v2 : A v8v8 : A v1v1 : [2, 2, , ,  ] [1, , , ,  ] [2, 2, , ,  ] [1, , , ,  ] [1, 4, 4, 4,  ] [5, , , ,  ] [1, 4, 4, 4,  ] [3, , , ,  ] [3, 2, , ,  ] [5, 2, 4, 4,  ] [3, 4, 4, 4,  ] [3, 2, 4, 4,  ]

Algorithm for query evaluation Experiments We conducted our experiments on a DELL desktop PC equipped with Pentium(R) 4 CPU 2.80GHz, 0.99GB RAM and 20GB hard disk. The code was compiled using Microsoft Visual C++ compiler version 6.0, running standalone. Tested methods In the experiments, we have tested four methods: -TwigStack (TS for short) [3], -Twig 2 Stack (T 2 S for short) [10], -PRIX [30], -tree-embedding (discussed in this paper, TE for short).

Tested methods In the experiments, we have tested four methods: -TwigStack (TS for short) [1], -Twig 2 Stack (T 2 S for short) [2], -PRIX [3], -tree-embedding (discussed in this paper, TE for short). Algorithm for query evaluation Experiments [1]N. Bruno, N. Koudas, and D. Srivastava, Holistic Twig Joins: Optimal XML Pattern Matching, in Proc. SIGMOD Int. Conf. on Management of Data, Madison, Wisconsin, June 2002, pp. 310-321. [2]S. Chen, H-G. Li, J. Tatemura, W-P. Hsiung, D. Agrawa, and K.S. Canda, Twig2Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents, in Proc. VLDB, Seoul, Korea, Sept. 2006, pp. 283-294. [3]P. Rao and B. Moon, Sequencing XML Data and Query Twigs for Fast Pattern Matching, ACM Transaction on Database Systems, Vol. 31, No. 1, March 2006, pp. 299-345.

Algorithm for query evaluation Experiments Theoretical computational complexities methodsQuery timeRuntime space usage TwigStackO(|D| |Q| ) Twig 2 StackO(|D||Q| 2 +|subTwig- Results| O(|D||Q|) PRIXO(|T| |Q| )O(|T| + |Q|) TEO(|T|leaf Q ) O(leaf T leaf Q ) Indexes XB-trees used for TwigStack, Twig 2 Stack, TE. Trie structure used for PRIX

Algorithm for query evaluation Experiments Data sets The TreeBank dataset is a real data set with narrow and deeply recursive structure that includes multiple recursive elements. (U. of Washington, The Tukwila System, available from http://data.cs.washington.edu/integration/tukwila/.) Data size: 82 MB Num. of nodes: 2.43 million Max/average tree depth: 36/7.9 Queries Q1: //VP[DT]//PRP_DOLLAR Q2: //S/VP/PP[IN]/NP Q3: //S/VP//PP[NP/VB]/IN Q4: //VP[.//PP/IN]//NP/*//JJ Q5: //S[CC][.//PP]//NP[VBZ][IN]//JJ

Algorithm for query evaluation Experiments Test results Page numbers For all the experiments, the buffer pool size was fixed at 2000 pages. The page size of 8KB was used. For each data set, all the tag names are stored in a single list and then each tag name is represented by its order number in that list during the evaluation of queries. In our implementation, each DocId occupies 4 bytes while a number in a Prüfer sequence, a LeftPos or a RightPos occupies 2 bytes. A levelNum value takes only 1 byte. I

Algorithm for query evaluation Experiments Q1Q2Q3Q4Q5 24 18 12 6 1 execution time (sec.) PRIX TE TS T 2 S + + + + + Q1Q2Q3Q4Q5 16 12 8 4 1 I/O time (sec.) PRIX TE TS T 2 S + + + + +

Summary An efficient method for evaluating ordered tree pattern queries in XML document databases - parent/child and ancestor/descendant relations -from-left-to-right relations Computational complexity - O(|T|leaf Q ) time -O(leaf T leaf Q ) space Experiments -TreeBank database -I/O time and CPU processing time

Thank you.

A New Algorithm for Evaluating Ordered Tree Pattern Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg,

Similar presentations

Presentation on theme: "A New Algorithm for Evaluating Ordered Tree Pattern Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A New Algorithm for Evaluating Ordered Tree Pattern Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg,

Similar presentations

Presentation on theme: "A New Algorithm for Evaluating Ordered Tree Pattern Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg,"— Presentation transcript:

Similar presentations

About project

Feedback