Native XML Databases Lior Schejter Erez Hadad. Outline Introduction: What? Why? How? An example XML DBMS: Natix Query processing  Natix query engine.

Native XML Databases Lior Schejter Erez Hadad

Outline Introduction: What? Why? How? An example XML DBMS: Natix Query processing  Natix query engine XML DBMS Storage  What is stored in an XML-DBMS?  Storage Management: Natix Storage  Storing documents and Indexes Transaction Management  Logging & Recovery  Locking

Native XML DBMS – What? A DBMS native to XML  Providing programming interfaces to manage and query XML data (1 or more documents)  Using a full blown, consistent API XML is the natural way of accessing data in this kind of DBMS  Defining, Querying etc. Providing all familiar features of DBMS  Transactions  Recovery  Multi threading

Native XML Databases – Why? XML structure provides information  Complex data model Ever tried representing an organizational hierarchy in a DBMS? Flexible data model  Both structured and semi structured data Inherent XML “behavior”  DBMS will be optimized for XML  Think about querying \ storing \ updating thousands of XML documents

Native XML Databases – How? XML can be thought of as a:  Text document (tags, simple text etc.)  A data model (nodes, children, siblings) Document centric – Storing and retrieving the entire document (or large parts of it)  Fast document construction and storage  Slow on queries, and retrieving data Data centric – Expressing the entire data in an internal data structure  Fast queries  Slow document retrieval A question of granularity

Natix Natix – An XML DBMS  Developed at Manheim University, Germany. Designed from scratch for storing and accessing XML data Supports XPath, XQuery Not designed to any specific language and \ or environment

Natix - Architecture

Storage layer – Manage all persistent data storage Service Layer – Provides DBMS functionality Binding Layer – Modules that map data and requests from other APIs to the Natix engine interface

Natix Query Execution Engine Able to execute all queries in a typical XML query language (e.g. XQuery). Expressive: Small number of powerful parameterized operators 2 main components:  Natix Physical Algebra (NPA) Algebraic operators, their composition etc.  Natix Virtual Machine (NVM) Plans used in the algebraic operators

NQE – The Rough Guide NPA works on sequences of tuples Each tuple holds values which can be a number, a string or a node handle  XML node handles can point to any node type NPA operators are implemented as iterators NPA operators take programs for the NVM as parameters  Usually passed at construction

Natix Virtual Machine NVM commands operate on register sets An NVM command may access several register sets  Global (X) register set for global (but specific to current plan of execution) data.  Z, Y register sets for arguments (pass between operators)  Y register sets are used only for binary operators A reference to register sets is passed between operators to avoid unnecessary copying

CMP_LEQ_SI4_ZCX 1 55 2 EXIT_F 2 X3 = Z1 + Z2 PRINT X3 ARITH_ADD_A_SI4_ZZX 1 2 3 PRINT_SI4 3 STOP if (Z1 > 55) then exit NVM program examples

XML NVM Commands NVM has about 150 XML specific commands  Copying documents handles, comparing fragments, traversing the XML document tree, printing etc. Mainly commands which correspond to XPath axes (child, sibling, descendant etc.)

Example: UnnestMap operator UnnestMap  Logically: takes a set valued expression and returns a single tuple for each element in the result set, flattening hierarchy 1 level deep  Physically: takes 3 programs and uses them as an iterator: init – Initialize the first tuple to be returned step – compute the next tuple Fin – finalize, cleanup

UnnestMap Operator init: step: fin: no finish program XML_CHILD_ZZ1 2 XML_VALID_ZX2 3 EXIT_F3 MV_XML_ZX2 4 XML_SIBLING_NEXT_XX4 4 XML_VALID_XX4 3 EXIT_F3 MV_XML_XZ4 2

Natix Physical Algebra Operators for selection and binding combination are borrowed from the relational and object databases contexts  select, join, map, group etc. The main concern: variable binding and result construction operators for XML

NPA – Query plans Every plan of execution has a scan operation at the bottom of it: scanning a document and retrieving its root in a tuple  e.g. Expression Scan UnnestMap and PathScan are used for variable bindings as well  An XPath expression can be translated into a sequence of UnnestMap operations  Or a single PathScan operation, which also eliminates duplicates, like in XPath

NPA Operators e.g. SELECT Any subplan

NPA Examples Example DTD:

NPA Example 1 Query: { FOR $c IN document(“bib.xml”)/bib/conference WHERE $c/year > 1996 RETURN {$c/title} {$c/year} }

NPA Example 1 The query plan: Function Calls Tuples

NPA Example 2 Query: { FOR $a IN document(“bib.xml”)//conference/article/author RETURN { FOR $b IN document(“bib.xml”)//conference/article, $c IN $b/author WHERE $c/@first=$a/@first AND $c/@last=$a/@last RETURN {$b/title} } }

NPA Example 2 The query plan:

Part II: Storage & Transactions in Native XML DBMS

What Is Stored In A Native XML-DB? XML Documents  The data itself  The DBMS tries to maintain imported documents as close to their original form as possible Data Definition Schemas  XML schemas, RelaxNG schemas, DTDs  Used for: Validating documents Organizing data on disk Validating and optimizing queries, constructing result sets Semantics: types, operations

What Is Stored In A Native XML-DB? Collections / Roots:  Bindings of XML documents into sets According to type or relevance A document may belong to more than one set  A collection may be related to a schema  Collections are valid instances of the data model and can be processed through queries for $d in collection(“foo”) where $d/Book/Author/Lastname = “Dante” return $d

What Is Stored In A Native XML-DB? “Standard DB” components:  Indexes Speed up query execution  Stored functions / procedures / triggers Embed business logic in storage Server-local processing – reduce network traffic Create views – abstractions of data  Access control data Users, resources, groups, permissions

Storage Management Usually, it is impossible to hold all the DB components in main memory  Several orders of magnitude smaller than secondary storage (disk) A common technique is to keep only a few fixed objects in memory and load the rest on-demand  Analogous to virtual memory / disk caching mechanisms of operating systems

Storage Management In Natix Physical Storage Partitions Pages Segments Records (Slotted Page Segments) Buffer Manager Page Interpreter Page Interpreter Internal Database Structure

How To Store XML Documents in a DB? Flat Stream:  Each XML document is a byte stream (e.g. a file or DBMS BLOB)  Fast handling of large sequential chunks or whole documents [document-centric]  Poor random access Requires parsing of XML  Example: Web server’s HTML file tree

How To Store XML Documents in a DB? Meta Modeling:  Separately store each element of the data model of an XML document (e.g. in a DBMS) Analogy to RDBMS: Entities and relations (of an ERD model) stored in separate records in tables  Fast random access [data-centric]  Slow processing of whole documents out of (possibly) thousands of separate records  Mechanism required for “translating” between data models, e.g. XML Relational

How To Store XML Documents in a DB? Mix of FS and MM:  Redundant: Store each document both as a byte stream and as a collection of records Read access is optimal: match the case Write access has high overhead: update both types of storage in each operation  Hybrid: Define a “granularity threshold” A “small structure” object is stored as a flat stream inside a single record of a database A “large structure” object is divided into several records Leverage time between whole-document and per-node operations

Natix XML Storage A hybrid approach:  Each database record contains a single subtree of an XML document  A dynamic granularity threshold, adapting to size and structure of documents at runtime  A subtree can grow and split into several records  Small subtrees can be merged into a larger subtree in a single record

Types of Stored Nodes in Natix Aggregate nodes: inner nodes of the tree, containing their respective child nodes  Helper aggregate nodes: “virtual” aggregate nodes used for grouping subsets of children of an actual aggregate node into subtrees in records Literal nodes: leaf nodes each containing an unparsed stream of bytes Proxy nodes: “virtual” nodes that point to subtrees contained in other records f1 h1 p1 f1

Natix Storage Example XML File:........ Logical Tree: f1 f4 f8f7 f5f6f3f2

Natix Storage Example Physical Tree: f1 f4 f8f7f3f2f5f6 h1h2 p3p2p1 r1 r2r3r4

Modifying Documents In Natix The physical tree is regarded as a B- Tree of records  maintained balanced If, when inserting a subtree, a record becomes too big  Split the record into a separator part, left part and a right part  Insert the separator into the parent record  The algorithm may repeat in the parent Similarly, a delete operation can result in record merger

Modifying Documents In Natix f1 f6 f7 f8f9 f11f12 f13f14 f10f5 f2 f3f4 S RL 2. f7 is the Split Node 1. Add node f10 to a record containing the following subtree: 3. Separator S is the path from the root up to but not including f7 4. Right Forest R is induced by the subtree of f7 and all the descendants of S located right of f7 5. Left Forest L is induced by the rest of the nodes

Modifying Documents In Natix f12 f13f14 f11f10f7 f8f9 f5f2 f3f4 f1 f6 h2h1 p1 1. In L & R, subtrees with sibling roots are grouped using helper aggregates S RL 2. Each subtree is put into a separate new partition record 4. The separator either replaces the proxy in the parent or forms a new root 3. The separator connects to the partition records through proxie nodes r4r3r1r2 r parent

Indexes In Native XML-DBs Using indexes accelerate evaluation of queries by quickly locating elements / values / text in the DB Indexes may be created  as fixed parts of the storage systems, or  upon user request, or  automatically due to repeated use of certain queries Index granularity may vary:  Point to each node that contains a specific key, or just to the containing document  A trade-off between size & construction speed of the index vs. its effectiveness

Indexes In Native XML-DBs Common types of indexes:  Value indexes: list the locations of each typed value of a node: E.g., locations of the integer value 1492.  Element indexes: list the locations of elements in documents, preserving hierarchy Locate an element of a specific type (//footnote) or in a specific context (/appendix/footnote) E.g., Tamino index structure, Natix XASR

Indexes In Native XML-DBs  Full Text Indexes: List the location of text within the content of elements Common technique: inverted files “Location of word”: offset in file and / or in document hierarchy Becomes more useful when the document is less structured In the future, text index mechanism is expected to resemble a modern search engine:  Handle word equivalence (single/plural, synonyms)  Ranked matching (degree of proximity instead of true/false)

Natix eXtended Access Support Relations (XASR) bioml organism organelle label “cytoskeleton”“mitochondrion” 1 12 2 11 3 6 7 10 4 5 8 9 parentdocIDeTyped max d min NULL0bioml121 10organism112 20organelle63 30label54 20organelle107 70label98 1. The document is traversed in DFS order. Each node is assigned a value d min upon entry and d max upon exit. d min is also the unique node id. 2. An XASR table is constructed as following: 3. During query evaluation, path connectors (‘/’ or ‘//’) are resolved through join operations on the XASR table: The join predicate for ‘/’ is: x i.docID = x i+1.docID and x i.d min = x i+1.parent The join predicate for ‘//’ is: x i.docID = x i+1.docID and x i.d min x i+1.d max

Transaction Management An XML DBMS provide support for transactions  A sequence of operations on XML items that can be either committed or rolled-back altogether Transaction execution follows ACID properties:  Atomicity: Each transaction should either complete or have no effect at all  Consistency: Each transaction should transform the DB from one consistent state to another  Isolation (serialization): Concurrently-executing transactions should behave as if they’re executing in some sequential order  Durability: Once committed, a transaction’s effect on the DB is permanent

Transaction Management Consistency is achieved by properly defining the transaction boundaries  Example: when moving money between bank accounts, update both accounts in one transaction  Responsibility of the application programmer The other properties are provided by the DBMS  Atomicity and durability are provided through logging and recovery  Isolation is maintained through locking

Logging And Recovery Logging and recovery provide two important functions: undo and redo  Undo of transactions that are aborted (will not complete) Enables atomicity  Redo of committed transactions in case of DBMS failure before transaction results were completely written to disk Enables durability

Logging And Recovery The XML DBMS keeps a log of all operations affecting the DB Each transaction operation that writes to a DB item generates a log record  Write-ahead logging: log then write All log records of the same transaction are linked chronologically forward (redo) and backward (undo)

Logging And Recovery In Natix Optimizing L&R for XML hierarchies:  Subsidiary Logging: log records are cached and unified into more compact records before entering the log One log record of adding a subtree instead of many added-single-node log records  Annihilator Undo: No need to perform undo operations that are covered by later undo operations Skip remove-node / modify-node operations that are followed by matching remove-subtree operations

Locking Mechanism In order to ensure that transactions are isolated from each other, each transaction locks its DB resources before operating on them  Locking a resource prevents other transactions from modifying/accessing it until unlocked  Most common locking protocol: S2PL (Strict 2-Phase Locking) Every resource is locked before its first access All resources are atomically unlocked together with abort/commit

Types Of Locks Classic Locks:  S: Shared lock – for reading an item Two transaction can share read access to an item Prevents item modification by other transactions  X: eXclusive lock – for modifying an item At most one transaction can hold X lock on an item Prevents other transactions from accessing item XS -+S --X held lock requested lock Compatibility Matrix:

Types Of Locks Problem - While a transaction T operates on an element in an XML document:  Other transactions should be allowed to operate on other elements of the document Increase concurrency  However, no other transaction should be allowed to modify nodes that contain the element T is working on Maintain the tree structure

Types Of Locks Solution - Warning Locks:  Lock all the nodes on the path from the root to the designated node from being modified by other transactions, according to the intended operation on the designated node  IS: Intension Share – intend to read sub- elements of this node  IX: Intension eXclusive – intend to modify sub- elements of this node

Types Of Locks  SIX: Share Intension eXclusive – intend to read this node and modify sub-elements of it Compatibility Matrix: SIXIXISXS --+-+S -----X +++-+ -++--IX --+--SIX held lock requested lock

Types Of Locks - Natix What if a transaction accesses a node directly without traversing down from the root?  E.g., when dereferencing an IDREF value  Need to first find the root to begin locking  Make sure the path to node is not modified until locking completes In Natix, nodes are first locked in special Shared Parent Pointer (SPP) mode on way up, then regular locks on way down  Minimal lock, incompatible only with X

Lock Control In Natix Lock Escalation: If a transaction holds too many locks, its locks are grouped and converted to bigger locks over entire documents (or even segments), if possible  If the transaction still holds too many locks, it is aborted Deadlock Detection: If a transaction waits too long for a lock, the system starts looking for deadlocks involving the waiting transaction. If found, the transaction is aborted

Native XML Databases Lior Schejter Erez Hadad. Outline Introduction: What? Why? How? An example XML DBMS: Natix Query processing  Natix query engine.

Similar presentations

Presentation on theme: "Native XML Databases Lior Schejter Erez Hadad. Outline Introduction: What? Why? How? An example XML DBMS: Natix Query processing  Natix query engine."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Native XML Databases Lior Schejter Erez Hadad. Outline Introduction: What? Why? How? An example XML DBMS: Natix Query processing  Natix query engine.

Similar presentations

Presentation on theme: "Native XML Databases Lior Schejter Erez Hadad. Outline Introduction: What? Why? How? An example XML DBMS: Natix Query processing  Natix query engine."— Presentation transcript:

Similar presentations

About project

Feedback