Lecture 9: XML Compression

Lecture 9: XML Compression

Semistructured Data / XML
loosely structured (no restrictions on tags & nesting relationships) no schema required XML under the “semistructured” umbrella self-describing the standard for information representation & exchange

XML data file can be modeled in a tree form
<Staff> <Name> <FirstName> Raymond </FirstName> <LastName> Wong </LastName> </Name> <Login> wong </Login> <Ext> 5932 </Ext> </Staff> Staff Name Login Ext “wong” “5932” “Raymond” “Wong” FirstName LastName

XPath evaluation <a><c>12</c><d>7</d><c>7</c></a> a b b / a / b [c = “12”] c d c 12 7 7

Query evaluation Top-down Bottom-up Hybrid

XPath evaluation <a><c>12</c><d>7</d><c>7</c></a> a b b / a / b [c = “12”] c d c 12 7 7

XPath evaluation <a><c>12</c><d>7</d><c>7</c></a> a b b / a / b [c = “12”] <c>12</c><d>7</d> c d c 12 7 7

Path indexing Traversing graph/tree almost = query processing for semistructured / XML data Normally, it requires to traverse the data from the root and return all nodes X reachable by a path matching the given regular path expression Motivation: allows the system to answer regular path expressions without traversing the whole graph/tree

Major Criteria for indexing
Speed up the search (by cutting the search space down) Relatively smaller size than the original data graph/tree Easy to maintain (during data loading during updates)

An Example of DAG Data root o12 o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o13
member dept support staff name phone

Index graph based on language-equivalence
a reduced graph that summarizes all paths from the root in the data graph The paths from root to o12 staff dept/member support/member

Language-equivalent nodes
Let L(x) := {w |  a path from the root to x labeled w} The set L(x) may be infinite when there are cycles Nodes x, y are language-equivalent (x  y) if L(x) = L(y) We construct index I by taking the nodes to be the equivalent classes for 

Language-equivalent The paths from root to o3
staff dept/member Paths to o4 happen to be exactly the same 2 sequences Same for o8 and o12 o3  o4  o8  o12

Equivalence classes o3  o4  o8  o12 o1  o2  o7 o12  o13
root o12 o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o13 member dept support staff name phone o3  o4  o8  o12 o1  o2  o7 o12  o13 o5  o6  o9 o10 o11

The index graph root o1, o2, o7 o3, o4, o8, o12 o12, o13 o5, o6, o9
member support staff dept name phone

Query processing based on the index graph
root o1, o2, o7 o3, o4, o8, o12 o12, o13 o5, o6, o9 o10 o11 member support staff dept name phone dept/member/(name | phone) -> dept/member/name UNION dept/member/phone -> {o5, o6, o9} UNION {o10} -> {o5, o6, o9, o10}

About this indexing scheme
The index graph is never > the data In practice, the index graph is small enough to fit in memory Construct the index is however a problem check two nodes are language-equivalent is very expensive (are PSPACE) approximation based on bisimulation exists

A Data Guide root dept support staff o11 o1, o2, o7 o3, o4, o8, o12
member phone member name o12, o13 o3, o4, o8, o12 o5, o6, o9 o10 phone name o5, o6, o9 o10

About Data Guide unique labels at each node
(hence) extents are no longer disjoint query processing proceeds as before size of the index may >= data size good for data that is regular & has no cycles

XML-Specific Compressors
Unqueriable Compression (e.g. XMill): Full-chunked: data commonalities eliminated Very good compression ratio Queriable Compression (e.g. XGrind, XPRESS): Fine-grained: data commonalities ignored Inadequate compression ratio and time Support simple path queries with atomic predicate

XMill First specialized compressor for XML data
SAX parser for parsing XML data Still using gzip as its underlying compressor Clever grouping of data into containers for compression Compress XML via three basic techniques Compress the structure separately from the data Group the data values according to their types Apply semantic (specialized) compressors: Downloadable:

XMill Architecture:

An Example:Web Server Logs
ASCII File 15.9 MB (gzipped 1.6MB): |GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-| XML-ized apache web log inflates to 24.2 MB (gzipped 2.1MB): <apache:entry> <apache:host> </apache:host> <apache:requestLine> GET / HTTP/1.0 </apache:requestLine> <apache:contentType> text/html </apache:contentType> <apache:statusCode> 200</apache:statusCode> <apache:date> 1997/10/01-00:00:02</apache:date> <apache:byteCount> 4478</apache:byteCount> <apache:referer> </apache:referer> <apache:userAgent> Mozilla/3.1$[$ja$]$(I)</apache:userAgent> </apache:entry>

How Xmill Works: Three Ideas
Compress the structure separately from the data: gzip Structure gzip Data <apache:entry> <apache:host> </apache:host> . . . </apache:entry> GET / HTTP/1.0 text/html 200 … + =1.75MB

Group the data values according to their types: gzip Structure gzip Data1 gzip Data2 <apache:entry> . . . </apache:entry> … GET / HTTP/1.0 GET / HTTP/1.1 … + + =1.33MB

Apply semantic (specialized) compressors: gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ... =0.82MB Examples: 8, 16, 32-bit integer encoding (signed/unsigned) differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...) compress lists, records (e.g  4 bytes) Need user input to select the semantic compressor

Experiments

XML Compression

Compression Time

Transfer Time (& Decode)

XGRIND (Tolani & Haritsa, 2002)
Encodes elements and attributes using XMill’s approach DTD-conscious: enumerated attributes with k possible values are encoded using a log2 k-bit scheme Data values are encoded using non-adaptive Huffman coding Requires two passes over the input document Separate statistical model for each element/attribute Homomorphic compression: compressed document retains original structure June 24, 2008 XML Compression Techniques 31

XML Compression Techniques
XGRIND Original Fragment: Compressed Fragment: <student name=“Alice“> <a1>78</a1> <a2>86</a2> <midterm>91</midterm> <project>87</project> </student> T0 A0 nahuff(Alice) T1 nahuff(78) / T2 nahuff(86) / T3 nahuff(91) / T4 nahuff(87) / / June 24, 2008 XML Compression Techniques 32

XML Compression Techniques
XGRIND Many queries can be carried out entirely in compressed domain Exact-match, prefix-match Some others require only decompression of relevant values Range, substring Queryability comes at the expense of achievable compression ratio: typically within 65-75% that of XMill June 24, 2008 XML Compression Techniques 33

ISX Requirements Space does matter for many applications
Generally reducing space improves cache locality Indirection is expensive Support fast navigations Support fast insertion and deletion Support efficient joins Separate topology, text and schema

ISX Goal To find a space-efficient storage scheme for XML data without compromising both query and update performances

Proposed Storage Structure
The ISX Structure

Sample DBLP XML Fragment

Balanced Parenthesis Encoding

Node Navigations

Topology Tiers No. of ) No. of ( No. of text nodes
Min, max of forward excess Min, max of backward excess

Primitive operators

Topology Tiers No. of ) No. of ( No. of text nodes
Min, max of forward excess Min, max of backward excess Excess 2 Where is the close tag?

Tier 2 excess

Efficient Updates

Example 100 MB DBLP document 5 million XML nodes ISX: 1MB topology

Another example 100M DBLP MSXML ISX Runtime (loading) 329MB 67MB
Core Duo 1.83GHz 1GB RAM 5400 RPM Harddrive MS Vista 100M DBLP MSXML ISX Runtime (loading) 329MB 67MB Loading time 17.8s 0.67s Runtime (//www) 333MB //www 1.814s 0.143s 5M DBLP MSXML ISX Runtime (loading) 15MB 4MB Loading time 0.54s 0.035s Runtime (//www) 21MB //www 0.096s 0.004s

ISX Features

Experiments Setup Fixed at 64MB memory buffer Up to 16 GB XML document
E.g. 16 GB DBLP contains > 770 million nodes NO index or query optimization has been employed for ISX (except for ISX Stream where TurboXPath algorithm has been employed)

Storage Size (ISX vs NoK)

Storage Size (ISX, XMill, XGrind): DBLP

Storage Size (ISX, XMill): TreeBank

Bulk Loading Performance

Queries

Q1: //inproceedings

Q5: //article[.//month/text() = “July”]//title

Other queries

XPath 13 axes We can navigate along 13 axes: ancestor ancestor-or-self
attribute child descendant descendant-or-self following following-sibling namespace parent preceding preceding-sibling self

Node Navigation

Full document traversal

Update (Insertion) Performance

ISX Summary Small storage footprint Small runtime footprint
Fast and consistent performance on navigational access Superior query performance (further indexing / query optimization can be added) Superior update performance

Compressing and Searching XML Data Via Two Zips
Paolo Ferragina et al. Slides modified from P. Ferragina’s

An XML excerpt It is verbose ! ... <dblp> <book>
<author> Donald E. Knuth </author> <title> The TeXbook </title> <publisher> Addison-Wesley </publisher> <year> 1986 </year> </book> <article> <author> Ronald W. Moore </author> <title> An Analysis of Alpha-Beta Pruning </title> <pages> </pages> <year> 1975 </year> <volume> 6 </volume> <journal> Artificial Intelligence </journal> </article> ... </dblp> It is verbose !

A tree interpretation... XML document exploration  Tree navigation
XML document search  Labeled subpath searches Subset of XPath [W3C]

The Problem XML-native search engines
We wish to devise a compressed representation for a labeled tree T that efficiently supports some operations: Navigational operations: parent(u), child(u, i), child(u, i, c) Subpath searches: given a sequence P of k labels Content searches: subpath + substring search Visualization operation: given a node, visualize its descending subtree XML-aware compressors (like XMill, XmlPpm, ScmPpm,...) need the whole decompression XML-native search engines might exploit this tool as a core block for query optimization and (compressed) storage XML-queriable compressors (like XPress, XGrind, XQzip,...) poor compression and scan of the whole (compressed) file Summary indexes (like Dataguide, 1-index or 2-index) large space and do not support “content” searches Theoretically do exist many solutions, starting from [Jacobson, IEEE Focs ’89] no subpath/content searches, and poor performance on labeled trees

A transform for “labeled trees” [Ferragina et al, IEEE Focs ’05]
We proposed the XBW-transform that mimics on trees the nice structural properties of the Burrows-and-Wheeler Trasform on strings The XBW linearizes the tree T in 2 arrays s.t.: the compression of T reduces to use any compressor (gzip, bzip,...) over these two arrays the indexing of T reduces to implement simple rank/select query operations over these two arrays

The XBW-Transform Sa Sp Step 1. C B A D c a b C B D c a A b e C B C
D B C A C D A C Step 1. Visit the tree in pre-order. For each node, write down its label and the labels on its upward path Permutation of tree nodes upward labeled paths

The XBW-Transform Sa Sp Step 2. C B A D c a b C b a D c B A e A C B C
D A C D B C Step 2. Stably sort according to Sp upward labeled paths

The XBW-Transform Sp Slast Sa Key fact Step 3. C B A D c a b XBW 1 C b
C b a D c B A e A C B C C D A C D B C Key fact Nodes correspond to items in <Slast,Sa> Step 3. Add a binary array Slast marking the rows corresponding to last children

XBzip – a simple XML compressor
Tags, Attributes and symbol = XBW is compressible: Sa and Spcdata are locally homogeneous Slast has some structure Pcdata

Some structural properties
B A D c a b C XBW B Slast Sa Sp 1 C b a D c B A e A C B C C D A C D B C B A B D c b a D D a c a c b Two useful properties: Children are contiguous and delimited by 1s Children reflect the order of their parents

XBW is navigational C Sp Slast Sa A 2 B 5 C 9 D 12 C B A D c a b C XBW
C b a D c B A e A C B C C D A C D B C A B Select in Slast the 2° item 1 from here... D c b a D D a Get_children c a c b Rank(B,Sa)=2 XBW is navigational: Rank-Select data structures on Slast and Sa The array C of |S| integers

XBW is searchable (count subpaths)
D 12 C B A D c a b P[i+1] XBW-index Slast Sa Sp P = B D 1 C b a D c B A e A C B C C D A C D B C fr Rows whose Sp starts with ‘B’ lr Their children have upward path = ‘D B’ Inductive step: Pick the next char in P[i+1], i.e. ‘D’ Search for the first and last ‘D’ in Sa[fr,lr]  Jump to their children XBW is searchable: Rank-Select data structures on Slast and Sa Array C of |S| integers fr lr 2 occurrences of P because of two 1s

Lecture 9: XML Compression

Similar presentations

Presentation on theme: "Lecture 9: XML Compression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 9: XML Compression

Similar presentations

Presentation on theme: "Lecture 9: XML Compression"— Presentation transcript:

Similar presentations

About project

Feedback