TREECHOP: A Tree- based Query-able Compressor for XML Gregory Leighton, Tomasz Müldner, James Diamond Acadia University June 6, 2005.

TREECHOP: A Tree- based Query-able Compressor for XML Gregory Leighton, Tomasz Müldner, James Diamond Acadia University June 6, 2005

Outline XML TREECHOP  Compression Strategy  Decompression Strategy  Querying Strategy Experimental Results Conclusions

Extensible Markup Language (XML) What is it?  A standard for semi-structured data representation introduced in 1998  Data is surrounded by markup tokens (elements and attributes) used to indicate semantic meaning Characteristics?  Verbose (often 5 – 10 times larger than alternative formats like CSV)  Lots of repetition… plenty of opportunities for data compression

Example XML Document root element attribute data value comment

TREECHOP: Compression Strategy Parsing splits document into three segments:  Prologue: stores text occurring before document’s root element  Document Tree: contains all document contents between and including root element start and end tags  Epilogue: stores text occurring after document’s root element

Example XML Document Prologue Epilogue Document Tree

Document Tree Root node corresponds to document’s root element Character data segments are represented using leaf nodes XML markup represented using non-leaf nodes; 5 types of non-leaf nodes:  Element, attribute, CDATA, comment, processing instruction

Document Tree Generation Get next token from XML parser Construct tree node from token Write tree node to compression stream 12 3

Document Tree Nodes Each node in the tree has an associated label value, L  Element node  name of the element  Attribute node  ‘@’ + name of the attribute  Comment, CDATA, processing instruction nodes  all text between delimiting section markers The path for a node v n consists of /L 1 /L 2 …/L n where a route connecting the root node v 1 with v n consists of nodes v 1, v 2, …, v n and L i is the label for node v i

Codeword Generation A binary codeword is assigned to each non-leaf node, based on node path  Multiple nodes with identical path are assigned same codeword Codeword is used during decompression and querying operations to identify the value and type of each node

Codeword Generation The codeword C(v) assigned to a non-leaf node v with parent node p is formed by the concatenation of three codes  C(p): the codeword assigned to p  G(v): Golomb code assigned to v based on its ordering relative to p.  T(v): a sequence of 3 bits used to indicate node type

Example XML Document

Example Document Tree

Node PathC(v)C(v) /PurchaseOrder /PurchaseOrder/@no /PurchaseOrder/Date /PurchaseOrder/CustomerID /PurchaseOrder/Order /PurchaseOrder/Order/Item /PurchaseOrder/Order/Item/ProductNo /PurchaseOrder/Order/Item/Quantity 00000 0000000001 00000010000 00000011000 00000100000 0000010000000000 000001000000000000000 0000010000000000010000 Codeword Assignment C(p) – portion inherited from parent node G(v) – portion assigned based on Golomb code T(v) – portion used to indicate node type

TREECHOP: Writing the Tree Encoded tree is written to compression stream in depth-first order; gzip is applied to further compress the encoded tree Non-leaf nodes: written as 3-tuple (L, C, D)  L is a byte indicating bit length of code word  C is a sequence of  L / 8  bytes containing code word  D is the node’s label (e.g. element/attribute name) - reserved byte values are used to signal beginning/end of sequence of raw character data

TREECHOP: Writing the Tree On second and subsequent occurrences of a particular codeword, only the 2-tuple (L, C) is written (decoder is able to infer associated D) Leaf nodes are transmitted in same manner as D value for non-leaf nodes Each node encoding is transmitted immediately after node construction – avoids necessity of building entire tree in memory

TREECHOP: Decompression Strategy Decoder operates by reading node data from compression stream. For each non-leaf node: 1. Determine D value 2. Determine node type 3. Surround D with XML syntax appropriate to the node type and immediately emit to the decompression stream

TREECHOP: Querying Strategy An individual query handler is registered with the decoder for each query Single scan of compression stream is carried out, using a stack to keep track of current path When query predicate path is matched, the current codeword is recorded and remainder of compression stream is scanned for future occurrences Each time a query match is encountered, the associated D value is extracted from the compression stream and passed to the query handler for processing

Experimental Results: Compression Rates FileSize(KB)ElementsAttributesData (A) Baseball788270800230970 (B) Macbeth1753975097625 (C) 150emp269011508277 (D) 100000emp168316000011000005534311

Experimental Results: Compression/Decompression Speed Distance between sender/receiver: 20 km / 12 miles

Experimental Results: Querying Distance between sender/receiver: 20 km / 12 miles

Conclusions TREECHOP compresses at rates comparable to gzip, while also providing query-friendly annotations to the compression stream Using TREECHOP querying in place of alternative methods like XSLT yields a significant performance advantage on medium- to large-sized XML documents; advantage increases with document size

TREECHOP: A Tree- based Query-able Compressor for XML Gregory Leighton, Tomasz Müldner, James Diamond Acadia University June 6, 2005.

Similar presentations

Presentation on theme: "TREECHOP: A Tree- based Query-able Compressor for XML Gregory Leighton, Tomasz Müldner, James Diamond Acadia University June 6, 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TREECHOP: A Tree- based Query-able Compressor for XML Gregory Leighton, Tomasz Müldner, James Diamond Acadia University June 6, 2005.

Similar presentations

Presentation on theme: "TREECHOP: A Tree- based Query-able Compressor for XML Gregory Leighton, Tomasz Müldner, James Diamond Acadia University June 6, 2005."— Presentation transcript:

Similar presentations

About project

Feedback