Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.

Similar presentations


Presentation on theme: "1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China."— Presentation transcript:

1 1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China

2 2 Outline Introduction Preliminary PTwigStack Conclusion

3 3 Outline Introduction Preliminary PTwigStack Conclusion

4 4 Introduction(1) XML has been used extensively as a standard for information representation and exchange More and more data is stored and exchanged with XML format Effective and efficient querying of XML data is indispensable

5 5 Introduction(2) Using standard query language (XPath or XQuery) How can we write a proper query when: –the structure or schema is not fully available or –Extracting information from different data sources with different structure bibliography(1) bib(2)bib(…) book(4)year(3) 1999title(5)author(6) article(7) author(9)title(8) XMLJoe author(10) MaryXML Bob book titleauthor Q

6 6 Introduction (4) Using keyword based query For example[1] –Find title and author of the publications bibliography(1) bib(2)bib(…) book(4)year(3) 1999title(5)author(6) article(7) author(9)title(8) XMLJoe author(10) MaryXML Bob The answer is : (5,6), (8,9,10) [1]Y. Li, C. Yu, and H. V. Jagadish. Schema-Free XQuery. In Proceedings of VLDB2004, pages 72-83, 2003

7 7 Introduction (5) Using keyword based query How if node 6 and 8 are removed from the document –Find title and author of the publications bibliography(1) bib(2)bib(…) book(4)year(3) 1999title(5) article(7) author(9) Joe author(10) MaryXML The answer is : (5,9,10) Meaningless Result (5,NULL), (NULL,9,10) Correct answer

8 8 Introduction (6) Using Partially Specified Twig Query (PSTQ) [2] –Can provide users the most flexibility But –No existing method can process a PSTQ efficiently [2]Heuristic Containment Check of Partial Tree-Pattern Queries in the Presence of Index Graphs, CIKM, 2006

9 9 Introduction(7) Objective –A concise but effective way to specify more flexible semantics constrains in a twig query –An efficient approach to process a PSTQ holistically without deriving twig queries and process them one by one Scan Once: Each stream whose elements’ tag appears in the twig pattern is scanned only once. No redundant output: None of the intermediate path solutions is useless Bounded space complexity: The space required by the algorithm is bounded by a factor which is independent of source document size.

10 10 Outline Introduction Preliminary –Holistic Twig Join –Partially Specified Twig Query PTwigStack Conclusion

11 11 Preliminary- Holistic Twig Join[3] Query Processing –Output useful Path Solutions –Merge all path solutions to get final results Data Structure –Each query node is associated with a stack and an element stream Benefits –No useless path solutions R a1 b1 a2 b2c1 A BC Q XML document [3]N. Bruno, N. Koudas, and D. Srivastava: Holistic twig joins: Optimal XML pattern matching. TechnicalR eport Columbia University March 2002

12 12 Preliminary- Partially Specified Twig Query[2] Q1 consists of two partial paths (PP), p1 and p2 In p1, Y is descendant of W In p2, W and A are being at the same path p1 share W with p2 “*” means p2 is output path [2]Heuristic Containment Check of Partial Tree-Pattern Queries in the Presence of Index Graphs, CIKM, 2006 Q1 Compared with Twig Query: –Some nodes are specified with being at the same path relationship with other nodes, but not the precedence relationship Compared with keyword based query: –Each part of the query can be a path expression, but not just keyword Benefits of using PSTQ: –Users can specify query with whatever partial knowledge they have whenever possible

13 13 Preliminary- Partially Specified Twig Query Query Processing of PSTQ: A naïve method –Deriving Twig Queries –Processing each twig query Problem of the naïve method –Processing cost is too high –Eliminating redundant results A B C A C B B A C A BC Q Q1Q2Q3Q4 a1 b1 c1 Xml document

14 14 Outline Introduction Preliminary PTwigStack Conclusion

15 15 PTwigStack __PSTQ Expression Extending XPath by adding an operator – “ ” is used to denote being at the same path relationship A B is equivalent to A//B or B//A A B C ? A B C A B C A C B C A B C B A B AC Q Q1Q2Q3Q4Q5 B A C B C A Q6Q7

16 16 PTwigStack Objective –Scan Once –No redundant output –Bounded space complexity Problems –Which query node should be processed first? –Which element should be processed first? –How to guarantee no useless path solutions from being produced? b1 a1a2 c1 b2 b3 Document B A C A B C A C B B A C A BC QQ1Q2Q3Q4 According to special order in the given Query Element with solution extension Element which cannot participate in answers will not be pushed into stack

17 17 PTwigStack Problems(1) –Which query node should be processed first? –Deep first order –ABC–ABC b1 a1a2 c1 b2 b3 Document B A C A B C A C B B A C A BC QQ1Q2Q3Q4

18 18 PTwigStack Problems(2) –Which element should be processed first? –The element with Partial Solution Extension b1 a1a2 c1 b2 b3 Document B A C A B C A C B B A C A BC QQ1Q2Q3Q4 Partial Solution Extension –We say a query node q has a PSE iff q satisfies any one of the following conditions: If q is a leaf node, C q does not equal to NULL. If q is not a leaf node, for each q’ ∈ children(q) –If q//q’, then C q is ancestor of C q’ a1 c1

19 19 PTwigStack Problems(2) –Which element should be processed first? –The element with Partial Solution Extension b1 a1a2 c1 b2 b3 Document B A C A B C A C B B A C A BC QQ1Q2Q3Q4 Partial Solution Extension –We say a query node q has a PSE iff q satisfies any one of the following conditions: If q is a leaf node, C q does not equal to NULL. If q is a non-leaf node, for each q’ ∈ children(q) –If q//q’, then C q is ancestor of C q’ –If q q’ (being at the same path) and q’ has a PSE, then C q can cover C q’ or be covered by C q’, or C q.end < C q’.start b1 a1 c1 c0 a1 b1 c1 a1 b1 c1

20 20 PTwigStack Problems(2) –Which element should be processed first? –The element with Partial Solution Extension b1 a1a2 c1 b2 b3 Document B A C A B C A C B B A C A BC QQ1Q2Q3Q4 Partial Solution Extension –We say a query node q has a PSE iff q satisfies any one of the following conditions: If q is a leaf node, C q does not equal to NULL. If q is a non-leaf node, for each q’ ∈ children(q) –If q//q’, then C q is ancestor of C q’ –If q q’ (being at the same path) and q’ has a PSE, then C q can cover C q’ or be covered by C q’, or C q.end < C q’.start –If q q’ and q’ hasn’t PSE, let p be descendent of q’ which has PSE, then Cq.start<Cp.start

21 21 PTwigStack Feature of Partial Solution Extension –If E has a PSE, E must have a Solution Extension of some twig queries derived from the given PSTQ, which means C E may participate in final results. Usage of Partial Solution Extension –Guiding the executing of PTwigStack

22 22 PTwigStack Problems(3) –How to guarantee no useless path solutions from being produced? Prevent useless elements from being pushed into stack –What is useless element? cannot satisfy query requirement with top elements in correlated stacks or head element in each element stream c1 b1 a1 Document B A C a1 Document c1 a0 b1 a1 b1c1 Document

23 23 PTwigStack Data Structure –Stack Each query node is also associated with a stack to compactly represent temporal results –Tag index Each query node is associated with an element stream

24 24 PTwigStack PTwigStack(root) // the first stage 1while not end(root) 2 q = getNext(root) 3 Clean All Stacks related with q and output relevant path solutions 4 If Cq can be pushed into Stack Sq 5 Push(Sq, Cq) 6 Processing other elements C q’ iteratively where q’ is child of q in the query and C q’.start < C q.start 7. Output all possible path solutions 8. Advance(Cq) //the second stage 9MergeAllPathSolution(); 6

25 25 PTwigStack b1 a1 a3 c2 B A C c1 b2a2 B A C A B C A C B B A C A BC QQ1Q2Q3Q4 Output: Final Result: PTwigStack(root) // the first stage 1.while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions 4. If Cq can be pushed into Stack Sq 5. Push(Sq, Cq) 6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start 7. Output all possible path solutions 8. Advance(Cq) //the second stage 9.MergeAllPathSolution();

26 26 PTwigStack b1 a1 a3 c2 B A C c1 b2a2 B A C A B C A C B B A C A BC QQ1Q2Q3Q4 Output: Final Result: PTwigStack(root) // the first stage 1.while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions 4. If Cq can be pushed into Stack Sq 5. Push(Sq, Cq) 6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start 7. Output all possible path solutions 8. Advance(Cq) //the second stage 9.MergeAllPathSolution(); c1

27 27 PTwigStack b1 a1 a3 c2 B A C c1 b2a2 B A C A B C A C B B A C A BC QQ1Q2Q3Q4 a1 b1 Output: Final Result: PTwigStack(root) // the first stage 1.while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions 4. If Cq can be pushed into Stack Sq 5. Push(Sq, Cq) 6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start 7. Output all possible path solutions 8. Advance(Cq) //the second stage 9.MergeAllPathSolution();

28 28 PTwigStack b1 a1a3 c2 B A C c1 b2a2 B A C A B C A C B B A C A BC QQ1Q2Q3Q4 a1 b1 Output: Final Result: PTwigStack(root) // the first stage 1.while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions 4. If Cq can be pushed into Stack Sq 5. Push(Sq, Cq) 6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start 7. Output all possible path solutions 8. Advance(Cq) //the second stage 9.MergeAllPathSolution();

29 29 PTwigStack b1 a1 a3 c2 B A C c1 b2 a2 B A C A B C A C B B A C A BC QQ1Q2Q3Q4 a1 b1c2 Output: Final Result: a1c2 PTwigStack(root) // the first stage 1.while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions 4. If Cq can be pushed into Stack Sq 5. Push(Sq, Cq) 6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start 7. Output all possible path solutions 8. Advance(Cq) //the second stage 9.MergeAllPathSolution();

30 30 PTwigStack b1 a1 a3 c2 B A C c1 b2 a2 B A C A B C A C B B A C A BC QQ1Q2Q3Q4 a1 b1 Output: Final Result: a1c2 a1b2 b2 PTwigStack(root) // the first stage 1.while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions 4. If Cq can be pushed into Stack Sq 5. Push(Sq, Cq) 6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start 7. Output all possible path solutions 8. Advance(Cq) //the second stage 9.MergeAllPathSolution();

31 31 PTwigStack b1 a1 a3 c2 B A C c1 b2a2 B A C A B C A C B B A C A BC QQ1Q2Q3Q4 a1 b1 Output: Final Result: a1c2 a1b2 a1b1 a1b1c2 a1b2c2 PTwigStack(root) // the first stage 1.while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions 4. If Cq can be pushed into Stack Sq 5. Push(Sq, Cq) 6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start 7. Output all possible path solutions 8. Advance(Cq) //the second stage 9.MergeAllPathSolution();

32 32 PTwigStack Properties: –Each element is scanned only once –Each element in stack must participate in at least one final result –No “Eliminating Operation” for redundant results –Space bounded by |Q|×L where L is the longest path in the XML source document and |Q| is the number of nodes in the given query Q

33 33 Outline Introduction Preliminary PTwigStack Conclusion

34 34 Conclusion We propose a concise but effective way to express the semantics of being at the same path by expanding XPath We propose a new concept, Partial Solution Extension, to guide the executing of getNext We propose a new holistic join method to process a PSTQ with root node

35 35 Future Work The above method cannot be applied directly to query without being specified with root node, e.g. –#[//A]//B –#[//A//B]//C –#[//A B]//C Possible Solution –Implementing special algorithm to process a PSTQ without being specified with root node (using Dewey code) –Using ORASS[4] to construct a twig query with more semantics constrains (using range code) [4] Gillian Dobbie, Wu Xiaoying, Tok Wang Ling, Mong Li Lee: ORA-SS: An Object-Relationship- Attribute Model for Semistructured Data TR21/00, Technical Report, Department of Computer Science, National University of Singapore, December 2000.

36 36 Thank You ! Q & A


Download ppt "1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China."

Similar presentations


Ads by Google