Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.

Slides:



Advertisements
Similar presentations
Ting Chen, Jiaheng Lu, Tok Wang Ling
Advertisements

Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
On Boosting Holism in XML Twig Pattern Matching Using Two Data Streaming Techniques Presenter: Lu Jiaheng Supervisor: Prof. Ling Tok Wang Joint work: Chen.
Symmetrically Exploiting XML Shuohao Zhang and Curtis Dyreson School of E.E. and Computer Science Washington State University Pullman, Washington, USA.
1 Virtual Cursors for XML Joins Beverly Yang (Stanford) Marcus Fontoura, Eugene Shekita Sridhar Rajagopalan, Kevin Beyer CIKM’2004.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,
DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al Khalifa et al., ICDE 2002.
CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.
1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
Web Data Management XML Query Evaluation 1. Motivation PTIME algorithms for evaluating XPath queries: – Simple tree navigation – Translation into logic.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
1 Database indices Database Systems manage very large amounts of data. –Examples: student database for NWU Social Security database To facilitate queries,
XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.
Distributed Constraint Optimization * some slides courtesy of P. Modi
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University.
A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML Represented by: Ai Mu Based on the paper written by Ning Zhang, Varun.
1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002.
G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,
1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Querying Structured Text in an XML Database By Xuemei Luo.
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML1 Efficient Structural Joins on Indexed XML Documents Shu-Yao Chien, Zografoula Vagena, Donghui.
TwigStackList¬: A Holistic Twig Join Algorithm for Twig Query with Not-predicates on XML Data by Tian Yu, Tok Wang Ling, Jiaheng Lu, Presented by: Tian.
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
Query Execution Section 15.1 Shweta Athalye CS257: Database Systems ID: 118 Section 1.
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
CS4432: Database Systems II Query Processing- Part 2.
Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.
Bootstrapped Optimistic Algorithm for Tree Construction
Holistic Twig Joins: Optimal XML Pattern Matching Written by: Nicolas Bruno Nick Koudas Divesh Srivastava Presented by: Jose Luna John Bassett.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Holistic Twig Joins: Optimal XML Pattern Matching Nicholas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 02 Presented by: Li Wei, Dragomir Yankov.
Chapter 4, Part II Sorting Algorithms. 2 Heap Details A heap is a tree structure where for each subtree the value stored at the root is larger than all.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,
Mehdi Kargar Department of Computer Science and Engineering
Efficient Evaluation of XQuery over Streaming Data
Efficient processing of path query with not-predicates on XML data
Database Management System
Presented by Sandhya Rani Are Prabhas Kumar Samanta
Holistic Twig Joins: Optimal XML Pattern Matching
Probabilistic Data Management
Chapter 15 QUERY EXECUTION.
OrientX: an Integrated, Schema-Based Native XML Database System
Structure and Content Scoring for XML
Sungho Kang Yonsei University
Structure and Content Scoring for XML
Structural Joins: A Primitive for Efficient XML Query Pattern Matching
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002

2 XML Query Processing XML query languages are complex, with many features. Natural and pervasive operation: matching XML data with a tree structured pattern. Previous attempts decompose query into small pieces and solve them separately: complex optimization problem.

3 Data Model XML database: forest of rooted, ordered, labeled trees: Nodes represent elements or values. Edges model direct containment properties.

4 Query Model: Subset of XQuery FOR $b IN document(“books.xml”)//book $a IN $b//author WHERE contains($b/title, ‘XML’) AND $a/fn = ‘jane’ AND $a/ln = ‘doe’ RETURN $b/year Find the year of publication of all books about “XML” written by “Jane Doe”. Specific twig patterns can match relevant portions of the XML database.

5 Outline Problem formulation. PathStack: Path Queries. TwigStack: Twig Queries. XB-Trees: Sub-linear pattern matching. Experimental evaluation.

6 Twig Pattern Matching Exploit indexes over the XML document: document not needed in main memory. Given a query twig pattern Q and an XML database D, compute the set of all matches for Q on D.

7 Indexing XML Documents Element positions represented as tuples (DocID, Left:Right, Level), sorted by Left. Child and descendant relationships between elements easily determined. Extension to classical IR inverted lists

8 Previous Attempts Based on binary joins [Zhang’01, Al-Khalifa’02]. Decompose query into binary relationships. Solve binary joins against XML database. Combine together “basic” matches. Main drawbacks: Optimization is required. Intermediate results can be large. - ((book title) XML) (year 2000) - (((book year) 2000) title) XML many other possibilities…

9 Our Approach: Holistic Joins Solve the entire twig query in two phases : 1- Produce “guaranteed” partial results using one pass. 2- Combine (merge join) partial results. Partial result smaller than final result. Exploit indexes. Skip irrelevant document fragments. Use containment relationships between query nodes.

10 Data Structures Each node q in query has associated: A stream T q, with the positions of the elements corresponding to node q, in increasing “left” order. A stack S q with a compact encoding of partial solutions (stacks are chained). XML fragmentQuery Matches Stacks

11 PathStack: Holistic Path Queries Repeatedly constructs stack encodings of partial solutions by iterating through the streams T q. Stacks encode the set of partial solutions from the current element in T q to the root of the XML tree. WHILE (!eof) qN = “getMin(q)” clean stacks push T qN ’s first element to S qN IF qN is a leaf node, expand solutions

12 PathStack Example Theorem: PathStack correctly returns all query matches with O( |input|+|output| ) I/O and CPU complexity.

13 Twig Queries Naïve adaptation of PathStack. Solve each root-to-leaf path independently. Merge-join each intermediate result. Problem: Many intermediate results might not be part of the final answer.

14 TwigStack 1) Compute only partial solutions that are guaranteed to extend to a final solution. 2) Merge partial solutions to obtain all matches. WHILE (!eof) qN = “getNext(q)” clean stacks IF T qN ’s first element is part of a solution, push it IF qN is a leaf node, expand solutions getNext might advance the streams in subTree(q) that are guaranteed not to be part of a solution

15 Analysis of TwigStack If getNext(q)=q N, then: Sub-tree q N has a solution using the stream heads. q N is “maximal”. getNext returns nodes in topological order. Stacks encode the set of partial solutions from the current element in getNext to the root of the XML tree. Theorem: TwigStack correctly returns all query matches with O(|input|+|output|) I/O and CPU complexity for ancestor/descendant relationships.

16 XB-Trees: A Variant of B-Trees Index positions of elements in the document. Allows adaptive granularity for consuming streams: advance and drillDown. TwigStack can be adapted to use XB-Trees with minimal changes.

17 Experimental Setting Implemented all algorithms in C++ using the file system as a simple storage engine. Synthetic and real databases. Unfolded DBLP database. X-Match + X-Mark benchmarks. Random XML documents. Techniques compared: Binary Join techniques. PathStack. TwigStack.

18 PathStack vs. Binary Joins XML database fragment: 1 million nodes. Path Query: A1//A2//A3//A4//A5//A6

19 PathStack vs. TwigStack

20 XB-Trees XML database fragment: 1 million nodes. Twig Query

21 Current and Future Work Handle arbitrary projections and constrained ancestor/descendant relationships optimally. Integrate TwigStack with value-based joins (id-refs, user defined predicates, etc.). Incorporate remaining axes (following, etc.).

22 Summary and Conclusions Developed holistic path join algorithms (PathStack and PathMPMJ) that are independent of size of intermediate results. Developed TwigStack, which generalizes PathStack for twig queries. Designed XB-Trees and integrated them to TwigStack.

23 Overflow Slides

24 PathMPMJ Non trivial adaptation of MPMGJN [Zhang’01]. Variant of merge-join that uses a stack of backtracking marks per query node.

25 PathStack vs. PathMPMJ XML database fragment: 1 million nodes.

26 TwigStack: Parent/Child edges Any algorithm that works over streams either gets deadlocked or results in suboptimal executions. QueryMatchesData (A1, B2, C2) (A2, B1, C1)

27 PathStack vs. PathMPMJ (2) DBLP database

28 PathStack vs. PathMPMJ (3) Benchmark database

29 PathStack vs. TwigStack (2) DBLP database

30 PathStack vs. TwigStack (3)

31 PathStack vs. TwigStack (4)

32 XB-Trees(2) DBLP database.

33 XB-Trees(3) Benchmark database.