XML Query Processing Yaw-Huei Chen

XML Query Processing Yaw-Huei Chen
Department of Computer Science and Information Engineering National Chiayi University

Outline Introduction to XML Query Languages Indexing Query Processing
Incremental Cache Maintenance Testing Reachability Conclusions 5/12/2006

From Documents to Data <h1>References</h1>
HTML describes presentation <h1>References</h1> <p>S. Abiteboul, P. Buneman, D. Suciu, <i>Data On The Web</i>, 2000.</p> 5/12/2006

From Documents to Data (cont.)
XML (eXtensible Markup Language) describes content <references> <book> <author>S. Abiteboul</author> <author>P. Buneman</author> <author>D. Suciu</author> <title>Data On The Web</title> <year>2000</year> </book> </references> 5/12/2006

XML Syntax Element Attribute XML document has a single root element
a piece of text bounded by matching tags elements can be nested <author>D. Suciu</author> Attribute unordered, each associated with an element node, has a name and a value alternative ways to represent data <book price="50" currency="USD">… </book> XML document has a single root element Well-formed XML documents tags must nest properly attributes must be unique 5/12/2006

XML Hierarchical Data Model
references book author title year 2000 Data on the Web S. Abiteboul P. Buneman D. Suciu … XML is ordered 5/12/2006

Specifying the Structure
DTD (Document Type Definition): A context-free grammar <!DOCTYPE references [ <!ELEMENT references (book+)> <!ELEMENT book (author*, title, year?)> <!ELEMENT author (#PCDATA)> <!ELEMENT title (#PCDATA)> <!ELEMENT year (#PCDATA)> ]> 5/12/2006

Specifying the Structure (cont.)
XML Schema in XML format element names and types associated locally includes primitive data types a superset of DTDs Valid XML documents the document must be well-formed the element names must follow the structure specified in a DTD file or an XML schema file 5/12/2006

Storing XML Documents Designing a specialized system for storing native XML data Using a DBMS to store the whole XML documents as text fields Using a DBMS to store the document contents as data elements It must support the XML’s ordered data model 5/12/2006

XPath Using path expressions to select nodes or node-sets
Single slash (/) : a direct child Double slash (//): a descendant at any level /references selects the root element references //book selects all book elements references//book selects all book elements that are descendant of the references element /references/* selects all the child nodes of the references element 5/12/2006

XQuery XQuery uses XPath expressions, but has additional constructs.
FLWR stands for the four main clauses of XQuery: FOR <variable bindings to individual nodes (elements)> LET <variable bindings to collections of nodes (elements)> WHERE <qualifier conditions> RETURN <query result specification> For example: for $b in doc("references.xml")//book where count ($b/author) > 0 return <book> { $b/title } { for $a in $b/author return $a } </book> 5/12/2006

Indexing Efficient mechanisms are needed for
Determining the ancestor-descendant relationship between XML elements Two types of indexes can help Structural index: It can reduce the time for traversing the XML hierarchy. Numbering scheme: It encodes each element by its positional information within the XML hierarchy. 5/12/2006

Structural Index DataGuides [Goldman97]:
Every label path of the source graph has exactly one data path instance in its DataGuide. C D A B C D A B A B B C D C D C D 5/12/2006

Structural Index (cont.)
1-Index [Milo99]: Grouping together nodes if they have the same set of incoming paths D C A B C A B D C A B D data graph 1-index dataguide 5/12/2006

Structural Index (cont.)
Covering indexes [Kaushik02] Forward and Backward Index (F&B-Index) Add inverse edges to the graph Compute the 1-index (or DataGuide) for the modified graph The size of F&B-Index is too large. To reduce the size: only useful tags are indexed do not index all idref edges (XPath gives a higher priority to tree edges and // matches only tree edges) exploit local similarity (short paths only) restrict tree depth 5/12/2006

Numbering Scheme Dewey Decimal Coding [Tatarinov02] 1 references 1.1
1.2 book book author author title year author title 1.1.1 1.1.2 1.1.3 1.1.4 1.2.1 1.2.2 5/12/2006

Numbering Scheme (cont.)
Inserting new elements references book author title year 1 1.1 1.2 1.1.1 1.1.2 1.1.3 1.1.4 1.1.5 1.2.1 1.2.2 new element nodes that require renumbering 5/12/2006

Preorder and postorder [Dietz82] (preorder, postorder) x is an ancestor of y iff x occurs before y in the preorder traversal and after y in the postorder traversal. references book author title year (1,10) (2,6) (8,9) (3,1) (4,2) (5,3) (6,4) (7,5) (9,7) (10,8) 5/12/2006

Various interval schemes (docno, begin:end, level) [Zhang01] The begin and end positions can be generated by doing a depth-first traversal of the tree and sequentially assigned a number at each visit. (preorder, size) [Li01] Size is an arbitrary integer larger than the total number of the current descendants. (lowest_post, postorder) [Agrawal89] Lowest_post is the lowest postorder number of its descendants. 5/12/2006

Query Processing To find all occurrences of a query pattern in the XML documents. Navigation-based approach Analyzing the input document one tag at a time. The query is represented as a non-deterministic finite automaton (NFA) [Diao03] Index-based approach Using pre-computed indexes to answer the query 5/12/2006

Holistic Twig Join [Bruno02]
Indexes string: (doc, left, level) element: (doc, left: right, level) Query: A//B//C A1 B1 A2 B2 C1 data SA SB SC A1 A2 B1 B2 C1 A1 B1 C1 A1 B2 C1 A2 B2 C1 stack encoding query results 5/12/2006

Stream Processing (Path)
A//B//C C4 (a) XPath Query B4 C2 A1 A1 B1 C1 A B C Start Pointer End Pointer B1 C3 (c) Query Stack Structure C1 D2 B3 B4 A1B1C1 A1B1C2 A1B4C4 D1 C2 B2 C4 E1 5/12/2006 (b) XML Data (d) Query Result

TWIG Algorithm Query structures Advantages
Stacks – representing query elements Pointers – links between stacks Start pointer, End pointer, Next pointer All results are stored in the stacks Advantages Can process twig query No join problem 5/12/2006

Stream Processing (Twig)
(a) Twig Query XPath : A[//B[//C]/D]//B A B C D B4 B3 B D2 A1 D A1 A B4 C2 B1 Start Pointer End Pointer Next Pointer C1 B C B1 C3 (c) Stack Structure of Twig Query C1 D2 B3 B4 (d) Twig Query Result A1B1C1D2B3 A1B1C1D2B4 D1 C2 B2 C4 E1 5/12/2006 (b) XML Data

XML Cache Maintenance Benefits of using caching data
Improving query performance Reducing loads in databases 5/12/2006

Update Maintenance Proxy Source Database Time (1) The Source Query
(2) Query Result Cache0 Update Update (1) Update Path Information Data1 (2) The Source Query ( Optional ) Cache1 (3) Query Result ( Optional ) Datan Cachen 5/12/2006

Contributions Incremental maintenance of XML cache
Number of source query: 0 Processing both path and twig query Improving query performance Reducing cache size 5/12/2006

XML Source Updates Two primitive operations Limitations Insertion
Insert a leaf node into an XML document Deletion Delete a leaf node from an XML document Limitations One operation at a time The XML document should be indexed 5/12/2006

Incremental Cache Maintenance
Two phases Differences discovery Data integration Query types supported Path query Twig query 5/12/2006

Path Query - Insertion 18 19 A: 1 (Name: Index) A B C 8 18 9 19 20
14 8 11 1 2 4 5 C: 3 B: 8 C: 13 B: 14 B: 4 D: 7 B: 9 C: 11 B: 15 B: 18 Start Point End Point C: 5 E: 6 E: 10 (d) Cache0 D: 16 E: 17 C: 19 C: 20 A B C 8 9 (a) XML Data 1 2 8 20 A//B//B//C (e) Temp Structure (b) XPath Query A B C 18 19 8 9 20 14 8 11 1 2 4 5 Update Path: A/B/B/B/E/C Update Index Path: 1/2/8/9/10/20 (c) Update Path (f) Cache1 5/12/2006

Path Query - Deletion A: 1 (Name: Index) A B C 18 19 8 18 9 20 19 B: 2
14 8 11 C: 3 B: 8 C: 13 B: 14 1 2 4 5 B: 4 D: 7 B: 9 C: 11 B: 15 B: 18 (d) Cache1 C: 5 E: 6 E: 10 D: 16 E: 17 C: 19 C: 20 A B C 8 9 (a) XML Data 1 2 8 20 A//B//B//C (b) XPath Query (e) Temp Structure A B C 18 19 14 8 11 1 2 4 5 Update Path: A/B/B/B/E/C Update Index Path: 1/2/8/9/10/20 (c) Update Path Start Point End Point 5/12/2006 (f) Cache2

(d) Source Query Structure
Twig Query – Insertion A B E C D 11 12 A: 1 9 7 B: 2 C: 8 6 C: 3 D: 7 E: 9 D: 10 1 4 B: 4 D:12 E: 11 2 5 3 C: 5 D: 6 (d) Source Query Structure (d) Cache0 (d) Cache1 (a) XML Data A B E C D 11 A 9 B E 12 C D A[//B[//C]//D]//E (b) XPath Query 1 2 5 3 (e) Temp Structure Update Path: A/B/D/D Update Index Path: 1/2/7/12 (c) Update Path Start Point End Point Next Point 5/12/2006

Experiments Language Data sets Query time
Java: Borland JBuilder 9 Enterprise Data sets Real data: 4.41MB Synthetic data: 860KB Query time Incremental maintenance: XCM algorithm Full re-computation 5/12/2006

Full re-computation (s)
Experiments Average Time Query XCM Algorithm (s) Full re-computation (s) Synthetic Data Set A/B/D A//B//C/D A//B//B//C//E 7.6941 Real issues//articles//authors issuesTuple//articlesTuples/title 5/12/2006

Reachability Many problems are modeled as DAGs Common requirement
Efficient reachability testing 5/12/2006

Related Work Simple way Transitive closure matrix
Bit-vector encoding schemes Numerical interval-based approach Other approaches 5/12/2006

The Concept We propose a new encoding scheme
find a spanning tree for the graph label the spanning tree with numerical intervals generate a transitive closure matrix for the tails and heads of the non-tree edges test reachability in constant time support updates Moreover, our approach is effective O(|Vnt|2) space costs, not O(|V |2) 5/12/2006

Transitive Closure Matrix Mnt
Example Tree edge Non-tree edge (1,18) a Head nodes Tail nodes b c g a 1 f h i (8,17) (2,7) b e (15,16) (9,12) (3,6) c f i h (13,14) (10,11) d g Transitive Closure Matrix Mnt (4,5) 5/12/2006

1 18 9 12 13 14 15 16 Tail node ranges 2 7 3 6 10 11 Head node ranges
(1,18) a T (8,17) a H b e (2,7) 13 14 15 16 f h i Tail node ranges (15,16) (9,12) (3,6) c H f T T i b h T (13,14) (10,11) 10 11 d g H (4,5) c g Head node ranges 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Tail begin + Tail end - Head begin Head end 5/12/2006

1 18 9 12 13 14 15 16 Tail node ranges Tree edge Non-tree edge (1,18)
(15,16) (8,17) (2,7) (3,6) (13,14) (9,12) (4,5) (10,11) a b i c f h d g e a 13 14 15 16 f h i Tail node ranges Head node b’s tail ranges in Mnt 1 9 12 13 14 15 16 18 Begin + End - Head node c’s tail ranges in Mnt Head node g’s tail ranges in Mnt Head nodes Tail nodes b c g a 1 f h i Transitive Closure Matrix Mnt 5/12/2006

Analysis Label the spanning tree with numerical intervals
decide the root in O(|E | + |V |) time. generating the spanning tree takes O(|V |) time and requires O(|V |) space. Find the representing tail and head ranges O(|Vnt| log |Vnt|) time and O(|Vnt|) space Generate the transitive closure matrix Mnt O(|Vnt|3) time and O(|Vnt|2) space. Update 5/12/2006

Conclusions A new encoding scheme to record the transitive closure information using nested numerical intervals O(|Vnt|3 + |V |) time and O(|Vnt|2 + |V |) space complexity Testing reachability in O(1) time Updating locally by adjusting the encoding information. 5/12/2006

Future Work Version management Materialized views Cache management
Aggregate query processing Streaming data processing ... 5/12/2006

XML Query Processing Yaw-Huei Chen

Similar presentations

Presentation on theme: "XML Query Processing Yaw-Huei Chen"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

XML Query Processing Yaw-Huei Chen

Similar presentations

Presentation on theme: "XML Query Processing Yaw-Huei Chen"— Presentation transcript:

Similar presentations

About project

Feedback