Download presentation
Presentation is loading. Please wait.
1
Lecture 9: XML Compression
2
Semistructured Data / XML
loosely structured (no restrictions on tags & nesting relationships) no schema required XML under the “semistructured” umbrella self-describing the standard for information representation & exchange
3
XML data file can be modeled in a tree form
<Staff> <Name> <FirstName> Raymond </FirstName> <LastName> Wong </LastName> </Name> <Login> wong </Login> <Ext> 5932 </Ext> </Staff> Staff Name Login Ext “wong” “5932” “Raymond” “Wong” FirstName LastName
4
XPath evaluation <a><b><c>12</c><d>7</d></b><b><c>7</c></b></a> a b b / a / b [c = “12”] c d c 12 7 7
5
Query evaluation Top-down Bottom-up Hybrid
6
XPath evaluation <a><b><c>12</c><d>7</d></b><b><c>7</c></b></a> a b b / a / b [c = “12”] c d c 12 7 7
7
XPath evaluation <a><b><c>12</c><d>7</d></b><b><c>7</c></b></a> a b b / a / b [c = “12”] <b><c>12</c><d>7</d></b> c d c 12 7 7
8
Path indexing Traversing graph/tree almost = query processing for semistructured / XML data Normally, it requires to traverse the data from the root and return all nodes X reachable by a path matching the given regular path expression Motivation: allows the system to answer regular path expressions without traversing the whole graph/tree
9
Major Criteria for indexing
Speed up the search (by cutting the search space down) Relatively smaller size than the original data graph/tree Easy to maintain (during data loading during updates)
10
An Example of DAG Data root o12 o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o13
member dept support staff name phone
11
Index graph based on language-equivalence
a reduced graph that summarizes all paths from the root in the data graph The paths from root to o12 staff dept/member support/member
12
Language-equivalent nodes
Let L(x) := {w | a path from the root to x labeled w} The set L(x) may be infinite when there are cycles Nodes x, y are language-equivalent (x y) if L(x) = L(y) We construct index I by taking the nodes to be the equivalent classes for
13
Language-equivalent The paths from root to o3
staff dept/member Paths to o4 happen to be exactly the same 2 sequences Same for o8 and o12 o3 o4 o8 o12
14
Equivalence classes o3 o4 o8 o12 o1 o2 o7 o12 o13
root o12 o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o13 member dept support staff name phone o3 o4 o8 o12 o1 o2 o7 o12 o13 o5 o6 o9 o10 o11
15
The index graph root o1, o2, o7 o3, o4, o8, o12 o12, o13 o5, o6, o9
member support staff dept name phone
16
Query processing based on the index graph
root o1, o2, o7 o3, o4, o8, o12 o12, o13 o5, o6, o9 o10 o11 member support staff dept name phone dept/member/(name | phone) -> dept/member/name UNION dept/member/phone -> {o5, o6, o9} UNION {o10} -> {o5, o6, o9, o10}
17
About this indexing scheme
The index graph is never > the data In practice, the index graph is small enough to fit in memory Construct the index is however a problem check two nodes are language-equivalent is very expensive (are PSPACE) approximation based on bisimulation exists
18
A Data Guide root dept support staff o11 o1, o2, o7 o3, o4, o8, o12
member phone member name o12, o13 o3, o4, o8, o12 o5, o6, o9 o10 phone name o5, o6, o9 o10
19
About Data Guide unique labels at each node
(hence) extents are no longer disjoint query processing proceeds as before size of the index may >= data size good for data that is regular & has no cycles
20
XML-Specific Compressors
Unqueriable Compression (e.g. XMill): Full-chunked: data commonalities eliminated Very good compression ratio Queriable Compression (e.g. XGrind, XPRESS): Fine-grained: data commonalities ignored Inadequate compression ratio and time Support simple path queries with atomic predicate
21
XMill First specialized compressor for XML data
SAX parser for parsing XML data Still using gzip as its underlying compressor Clever grouping of data into containers for compression Compress XML via three basic techniques Compress the structure separately from the data Group the data values according to their types Apply semantic (specialized) compressors: Downloadable:
22
XMill Architecture:
23
An Example:Web Server Logs
ASCII File 15.9 MB (gzipped 1.6MB): |GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-| XML-ized apache web log inflates to 24.2 MB (gzipped 2.1MB): <apache:entry> <apache:host> </apache:host> <apache:requestLine> GET / HTTP/1.0 </apache:requestLine> <apache:contentType> text/html </apache:contentType> <apache:statusCode> 200</apache:statusCode> <apache:date> 1997/10/01-00:00:02</apache:date> <apache:byteCount> 4478</apache:byteCount> <apache:referer> </apache:referer> <apache:userAgent> Mozilla/3.1$[$ja$]$(I)</apache:userAgent> </apache:entry>
24
How Xmill Works: Three Ideas
Compress the structure separately from the data: gzip Structure gzip Data <apache:entry> <apache:host> </apache:host> . . . </apache:entry> GET / HTTP/1.0 text/html 200 … + =1.75MB
25
How Xmill Works: Three Ideas
Group the data values according to their types: gzip Structure gzip Data1 gzip Data2 <apache:entry> . . . </apache:entry> … GET / HTTP/1.0 GET / HTTP/1.1 … + + =1.33MB
26
How Xmill Works: Three Ideas
Apply semantic (specialized) compressors: gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ... =0.82MB Examples: 8, 16, 32-bit integer encoding (signed/unsigned) differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...) compress lists, records (e.g 4 bytes) Need user input to select the semantic compressor
27
Experiments
28
XML Compression
29
Compression Time
30
Transfer Time (& Decode)
31
XGRIND (Tolani & Haritsa, 2002)
Encodes elements and attributes using XMill’s approach DTD-conscious: enumerated attributes with k possible values are encoded using a log2 k-bit scheme Data values are encoded using non-adaptive Huffman coding Requires two passes over the input document Separate statistical model for each element/attribute Homomorphic compression: compressed document retains original structure June 24, 2008 XML Compression Techniques 31
32
XML Compression Techniques
XGRIND Original Fragment: Compressed Fragment: <student name=“Alice“> <a1>78</a1> <a2>86</a2> <midterm>91</midterm> <project>87</project> </student> T0 A0 nahuff(Alice) T1 nahuff(78) / T2 nahuff(86) / T3 nahuff(91) / T4 nahuff(87) / / June 24, 2008 XML Compression Techniques 32
33
XML Compression Techniques
XGRIND Many queries can be carried out entirely in compressed domain Exact-match, prefix-match Some others require only decompression of relevant values Range, substring Queryability comes at the expense of achievable compression ratio: typically within 65-75% that of XMill June 24, 2008 XML Compression Techniques 33
34
ISX Requirements Space does matter for many applications
Generally reducing space improves cache locality Indirection is expensive Support fast navigations Support fast insertion and deletion Support efficient joins Separate topology, text and schema
35
ISX Goal To find a space-efficient storage scheme for XML data without compromising both query and update performances
36
Proposed Storage Structure
The ISX Structure
37
Sample DBLP XML Fragment
38
Balanced Parenthesis Encoding
39
Node Navigations
40
Topology Tiers No. of ) No. of ( No. of text nodes
Min, max of forward excess Min, max of backward excess
41
Primitive operators
42
Topology Tiers No. of ) No. of ( No. of text nodes
Min, max of forward excess Min, max of backward excess Excess 2 Where is the close tag?
43
Tier 2 excess
44
Efficient Updates
45
Example 100 MB DBLP document 5 million XML nodes ISX: 1MB topology
46
Another example 100M DBLP MSXML ISX Runtime (loading) 329MB 67MB
Core Duo 1.83GHz 1GB RAM 5400 RPM Harddrive MS Vista 100M DBLP MSXML ISX Runtime (loading) 329MB 67MB Loading time 17.8s 0.67s Runtime (//www) 333MB //www 1.814s 0.143s 5M DBLP MSXML ISX Runtime (loading) 15MB 4MB Loading time 0.54s 0.035s Runtime (//www) 21MB //www 0.096s 0.004s
47
ISX Features
48
Experiments Setup Fixed at 64MB memory buffer Up to 16 GB XML document
E.g. 16 GB DBLP contains > 770 million nodes NO index or query optimization has been employed for ISX (except for ISX Stream where TurboXPath algorithm has been employed)
49
Storage Size (ISX vs NoK)
50
Storage Size (ISX, XMill, XGrind): DBLP
51
Storage Size (ISX, XMill): TreeBank
52
Bulk Loading Performance
53
Queries
54
Q1: //inproceedings
55
Q5: //article[.//month/text() = “July”]//title
56
Other queries
57
XPath 13 axes We can navigate along 13 axes: ancestor ancestor-or-self
attribute child descendant descendant-or-self following following-sibling namespace parent preceding preceding-sibling self
58
Node Navigation
59
Full document traversal
60
Update (Insertion) Performance
61
ISX Summary Small storage footprint Small runtime footprint
Fast and consistent performance on navigational access Superior query performance (further indexing / query optimization can be added) Superior update performance
62
Compressing and Searching XML Data Via Two Zips
Paolo Ferragina et al. Slides modified from P. Ferragina’s
63
An XML excerpt It is verbose ! ... <dblp> <book>
<author> Donald E. Knuth </author> <title> The TeXbook </title> <publisher> Addison-Wesley </publisher> <year> 1986 </year> </book> <article> <author> Ronald W. Moore </author> <title> An Analysis of Alpha-Beta Pruning </title> <pages> </pages> <year> 1975 </year> <volume> 6 </volume> <journal> Artificial Intelligence </journal> </article> ... </dblp> It is verbose !
64
A tree interpretation... XML document exploration Tree navigation
XML document search Labeled subpath searches Subset of XPath [W3C]
65
The Problem XML-native search engines
We wish to devise a compressed representation for a labeled tree T that efficiently supports some operations: Navigational operations: parent(u), child(u, i), child(u, i, c) Subpath searches: given a sequence P of k labels Content searches: subpath + substring search Visualization operation: given a node, visualize its descending subtree XML-aware compressors (like XMill, XmlPpm, ScmPpm,...) need the whole decompression XML-native search engines might exploit this tool as a core block for query optimization and (compressed) storage XML-queriable compressors (like XPress, XGrind, XQzip,...) poor compression and scan of the whole (compressed) file Summary indexes (like Dataguide, 1-index or 2-index) large space and do not support “content” searches Theoretically do exist many solutions, starting from [Jacobson, IEEE Focs ’89] no subpath/content searches, and poor performance on labeled trees
66
A transform for “labeled trees” [Ferragina et al, IEEE Focs ’05]
We proposed the XBW-transform that mimics on trees the nice structural properties of the Burrows-and-Wheeler Trasform on strings The XBW linearizes the tree T in 2 arrays s.t.: the compression of T reduces to use any compressor (gzip, bzip,...) over these two arrays the indexing of T reduces to implement simple rank/select query operations over these two arrays
67
The XBW-Transform Sa Sp Step 1. C B A D c a b C B D c a A b e C B C
D B C A C D A C Step 1. Visit the tree in pre-order. For each node, write down its label and the labels on its upward path Permutation of tree nodes upward labeled paths
68
The XBW-Transform Sa Sp Step 2. C B A D c a b C b a D c B A e A C B C
D A C D B C Step 2. Stably sort according to Sp upward labeled paths
69
The XBW-Transform Sp Slast Sa Key fact Step 3. C B A D c a b XBW 1 C b
C b a D c B A e A C B C C D A C D B C Key fact Nodes correspond to items in <Slast,Sa> Step 3. Add a binary array Slast marking the rows corresponding to last children
70
XBzip – a simple XML compressor
Tags, Attributes and symbol = XBW is compressible: Sa and Spcdata are locally homogeneous Slast has some structure Pcdata
71
Some structural properties
B A D c a b C XBW B Slast Sa Sp 1 C b a D c B A e A C B C C D A C D B C B A B D c b a D D a c a c b Two useful properties: Children are contiguous and delimited by 1s Children reflect the order of their parents
72
XBW is navigational C Sp Slast Sa A 2 B 5 C 9 D 12 C B A D c a b C XBW
C b a D c B A e A C B C C D A C D B C A B Select in Slast the 2° item 1 from here... D c b a D D a Get_children c a c b Rank(B,Sa)=2 XBW is navigational: Rank-Select data structures on Slast and Sa The array C of |S| integers
73
XBW is searchable (count subpaths)
D 12 C B A D c a b P[i+1] XBW-index Slast Sa Sp P = B D 1 C b a D c B A e A C B C C D A C D B C fr Rows whose Sp starts with ‘B’ lr Their children have upward path = ‘D B’ Inductive step: Pick the next char in P[i+1], i.e. ‘D’ Search for the first and last ‘D’ in Sa[fr,lr] Jump to their children XBW is searchable: Rank-Select data structures on Slast and Sa Array C of |S| integers fr lr 2 occurrences of P because of two 1s
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.