Presentation is loading. Please wait.

Presentation is loading. Please wait.

XML Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 25, 2008.

Similar presentations


Presentation on theme: "XML Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 25, 2008."— Presentation transcript:

1 XML Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 25, 2008

2 Administrivia  For next time, please read & review the TurboXPath paper 2

3 3 XML: A Format of Many Uses  XML has become the standard for data interchange, and for many document representations  Sometimes we’d like to store it…  Collections of text documents, e.g., the Web, doc DBs  … How would we want to query those?  IR/text queries, path queries, XQueries?  Interchanging data  SOAP messages, RSS, XML streams  Perhaps subsets of data from RDBMSs  Storing native, database-like XML data  Caching  Logging of XML messages

4 4 XML: Hierarchical Data and Its Challenges  It’s not normalized…  It conceptually centers around some origin, meaning that navigation becomes central to querying and visualizing  Contrast with E-R diagrams  How to store the hierarchy?  Complex navigation may include going up, sideways in tree  Updates, locking  Optimization  Also, it’s ordered  May restrict order of evaluation (or at least presentation)  Makes updates more complex  Many of these issues aren’t unique to XML  Semistructured databases, esp. with ordered collections, were similar  But our efforts in that area basically failed…

5 5 Two Ways of Thinking of XML Processing  XML databases (today)  Hierarchical storage + locking (Natix, TIMBER, BerkeleyDB, Tamino, …)  Query optimization  “Streaming XML” (next time)  RDBMS  XML export  Partitioning of computation between source and mediator  “Streaming XPath” engines  The difference is in storage (or lack thereof)

6 6 XML in a Database  Use a legacy RDBMS  Shredding [Shanmugasundaram+99] and many others  Path-based encodings [Cooper+01]  Region-based encodings [Bruno+02][Chen+04]  Order preservation in updates [Tatarinov+02], …  What’s novel here? How does this relate to materialized views and warehousing?  Native XML databases  Hierarchical storage (Natix, TIMBER, BerkeleyDB, Tamino, …)  Updates and locking  Query optimization (e.g., that on Galax)

7 7 Query Processing for XML  Why is optimization harder?  Hierarchy means many more joins (conceptually)  “traverse”, “tree-match”, “x-scan”, “unnest”, “path”, … op  Though typically parent-child relationships  Often don’t have good measure of “fan-out”  More ways of optimizing this  Order preservation limits processing in many ways  Nested content ~ left outer join  Except that we need to cluster a collection with the parent  Relationship with NF 2 approach  Tags (don’t really add much complexity except in trying to encode efficiently)  Complex functions and recursion  Few real DB systems implement these fully  Why is storage harder?  That’s the focus of Natix, really

8 8 The Natix System  In contrast to many pieces of work on XML, focuses on the bottom layers, equivalent to System R’s RSS  Physical layout  Indexing  Locking/concurrency control  Logging/recovery

9 9 Physical Layout  What are our options in storing XML trees?  At some level, it’s all smoke-and-mirrors  Need to map to “flat” byte sequences on disk  But several options:  Shred completely, as in many RDBMS mappings  Each path may get its own contiguous set of pages  e.g., vectorized XML [Buneman et al.]  An element may get its 1:1 children  e.g., shared inlining [Shanmugasundaram+] and [Chen+]  All content may be in one table  e.g., [Florescu/Kossmann] and most interval encoded XML  We may embed a few items on the same page and “overflow” the rest  How collections are often stored in ORDBMS  We may try to cluster XML trees on the same page, as “interpreted BLOBs”  This is Natix’s approach (and also IBM’s DB2)  Pros and cons of these approaches?

10 10 Challenges of the Page-per-Tree Approach  How big of a tree?  What happens if the XML overflows the tree?  Natix claims an adaptive approach to choosing the tree’s granularity  Primarily based on balancing the tree, constraints on children that must appear with a parent  What other possibilities make sense?  Natix uses a B+ Tree-like scheme for achieving balance and splitting a tree across pages

11 11 Example Split point in parent page Note “proxy” nodes

12 12 That Was Simple – But What about Updates?  Clearly, insertions and deletions can affect things  Deletion may ultimately require us to rebalance  Ditto with insertion  But insertion also may make us run out of space – what to do?  Their approach: add another page; ultimately may need to split at multiple levels, as in B+ Tree  Others have studied this problem and used integer encoding schemes (plus B+ Trees) for the order

13 13 Does this Help?  According to general lore, yes  The Natix experiments in this paper were limited in their query and adaptivity loads  But the IBM people say their approach, which is similar, works significantly better than Oracle’s shredded approach

14 14 There’s More to Updates than the Pages  What about concurrency control and recovery?  We already have a notion of hierarchical locks, but they claim:  If we want to support IDREF traversal, and indexing directly to nodes, we need more  What’s the idea behind SPP locking?

15 15 Logging  They claim ARIES needs some modifications – why?  Their changes:  Need to make subtree updates more efficient – don’t want to write a log entry for each subtree insertion  Use (a copy of) the page itself as a means of tracking what was inserted, then batch-apply to WAL  “Annihilators”: if we undo a tree creation, then we probably don’t need to worry about undoing later changes to that tree  A few minor tweaks to minimize undo/redo when only one transaction touches a page

16 16 Annihilators

17 17 Assessment  Native XML storage isn’t really all that different from other means of storage  There are probably some good reasons to make a few tweaks in locking  Optimization remains harder  A real solution to materialized view creation would probably make RDBMSs come close to delivering the same performance, modulo locking

18 Next Time: “Streaming XML”  An XQuery consists of a series of XPath expressions in the FOR/LET clauses, plus a WHERE condition and a RETURN constructor  The FOR/LET clauses create bindings between variables and nodes (or node sets)  We can consider a set of bindings to be a tuple  So: can we build an XPath matcher that processes XML across the network, and produces tuple streams? 18


Download ppt "XML Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 25, 2008."

Similar presentations


Ads by Google