Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 9: XML Compression

Similar presentations


Presentation on theme: "Lecture 9: XML Compression"— Presentation transcript:

1 Lecture 9: XML Compression

2 Semistructured Data / XML
loosely structured (no restrictions on tags & nesting relationships) no schema required XML under the “semistructured” umbrella self-describing the standard for information representation & exchange

3 XML data file can be modeled in a tree form
<Staff> <Name> <FirstName> Raymond </FirstName> <LastName> Wong </LastName> </Name> <Login> wong </Login> <Ext> 5932 </Ext> </Staff> Staff Name Login Ext “wong” “5932” “Raymond” “Wong” FirstName LastName

4 XPath evaluation <a><b><c>12</c><d>7</d></b><b><c>7</c></b></a> a b b / a / b [c = “12”] c d c 12 7 7

5 Query evaluation Top-down Bottom-up Hybrid

6 XPath evaluation <a><b><c>12</c><d>7</d></b><b><c>7</c></b></a> a b b / a / b [c = “12”] c d c 12 7 7

7 XPath evaluation <a><b><c>12</c><d>7</d></b><b><c>7</c></b></a> a b b / a / b [c = “12”] <b><c>12</c><d>7</d></b> c d c 12 7 7

8 Path indexing Traversing graph/tree almost = query processing for semistructured / XML data Normally, it requires to traverse the data from the root and return all nodes X reachable by a path matching the given regular path expression Motivation: allows the system to answer regular path expressions without traversing the whole graph/tree

9 Major Criteria for indexing
Speed up the search (by cutting the search space down) Relatively smaller size than the original data graph/tree Easy to maintain (during data loading during updates)

10 An Example of DAG Data root o12 o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o13
member dept support staff name phone

11 Index graph based on language-equivalence
a reduced graph that summarizes all paths from the root in the data graph The paths from root to o12 staff dept/member support/member

12 Language-equivalent nodes
Let L(x) := {w |  a path from the root to x labeled w} The set L(x) may be infinite when there are cycles Nodes x, y are language-equivalent (x  y) if L(x) = L(y) We construct index I by taking the nodes to be the equivalent classes for 

13 Language-equivalent The paths from root to o3
staff dept/member Paths to o4 happen to be exactly the same 2 sequences Same for o8 and o12 o3  o4  o8  o12

14 Equivalence classes o3  o4  o8  o12 o1  o2  o7 o12  o13
root o12 o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o13 member dept support staff name phone o3  o4  o8  o12 o1  o2  o7 o12  o13 o5  o6  o9 o10 o11

15 The index graph root o1, o2, o7 o3, o4, o8, o12 o12, o13 o5, o6, o9
member support staff dept name phone

16 Query processing based on the index graph
root o1, o2, o7 o3, o4, o8, o12 o12, o13 o5, o6, o9 o10 o11 member support staff dept name phone dept/member/(name | phone) -> dept/member/name UNION dept/member/phone -> {o5, o6, o9} UNION {o10} -> {o5, o6, o9, o10}

17 About this indexing scheme
The index graph is never > the data In practice, the index graph is small enough to fit in memory Construct the index is however a problem check two nodes are language-equivalent is very expensive (are PSPACE) approximation based on bisimulation exists

18 A Data Guide root dept support staff o11 o1, o2, o7 o3, o4, o8, o12
member phone member name o12, o13 o3, o4, o8, o12 o5, o6, o9 o10 phone name o5, o6, o9 o10

19 About Data Guide unique labels at each node
(hence) extents are no longer disjoint query processing proceeds as before size of the index may >= data size good for data that is regular & has no cycles

20 XML-Specific Compressors
Unqueriable Compression (e.g. XMill): Full-chunked: data commonalities eliminated Very good compression ratio Queriable Compression (e.g. XGrind, XPRESS): Fine-grained: data commonalities ignored Inadequate compression ratio and time Support simple path queries with atomic predicate

21 XMill First specialized compressor for XML data
SAX parser for parsing XML data Still using gzip as its underlying compressor Clever grouping of data into containers for compression Compress XML via three basic techniques Compress the structure separately from the data Group the data values according to their types Apply semantic (specialized) compressors: Downloadable:

22 XMill Architecture:

23 An Example:Web Server Logs
ASCII File 15.9 MB (gzipped 1.6MB): |GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-| XML-ized apache web log inflates to 24.2 MB (gzipped 2.1MB): <apache:entry> <apache:host> </apache:host> <apache:requestLine> GET / HTTP/1.0 </apache:requestLine> <apache:contentType> text/html </apache:contentType> <apache:statusCode> 200</apache:statusCode> <apache:date> 1997/10/01-00:00:02</apache:date> <apache:byteCount> 4478</apache:byteCount> <apache:referer> </apache:referer> <apache:userAgent> Mozilla/3.1$[$ja$]$(I)</apache:userAgent> </apache:entry>

24 How Xmill Works: Three Ideas
Compress the structure separately from the data: gzip Structure gzip Data <apache:entry> <apache:host> </apache:host> . . . </apache:entry> GET / HTTP/1.0 text/html 200 + =1.75MB

25 How Xmill Works: Three Ideas
Group the data values according to their types: gzip Structure gzip Data1 gzip Data2 <apache:entry> . . . </apache:entry> GET / HTTP/1.0 GET / HTTP/1.1 + + =1.33MB

26 How Xmill Works: Three Ideas
Apply semantic (specialized) compressors: gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ... =0.82MB Examples: 8, 16, 32-bit integer encoding (signed/unsigned) differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...) compress lists, records (e.g  4 bytes) Need user input to select the semantic compressor

27 Experiments

28 XML Compression

29 Compression Time

30 Transfer Time (& Decode)

31 XGRIND (Tolani & Haritsa, 2002)
Encodes elements and attributes using XMill’s approach DTD-conscious: enumerated attributes with k possible values are encoded using a log2 k-bit scheme Data values are encoded using non-adaptive Huffman coding Requires two passes over the input document Separate statistical model for each element/attribute Homomorphic compression: compressed document retains original structure June 24, 2008 XML Compression Techniques 31

32 XML Compression Techniques
XGRIND Original Fragment: Compressed Fragment: <student name=“Alice“> <a1>78</a1> <a2>86</a2> <midterm>91</midterm> <project>87</project> </student> T0 A0 nahuff(Alice) T1 nahuff(78) / T2 nahuff(86) / T3 nahuff(91) / T4 nahuff(87) / / June 24, 2008 XML Compression Techniques 32

33 XML Compression Techniques
XGRIND Many queries can be carried out entirely in compressed domain Exact-match, prefix-match Some others require only decompression of relevant values Range, substring Queryability comes at the expense of achievable compression ratio: typically within 65-75% that of XMill June 24, 2008 XML Compression Techniques 33

34 ISX Requirements Space does matter for many applications
Generally reducing space improves cache locality Indirection is expensive Support fast navigations Support fast insertion and deletion Support efficient joins Separate topology, text and schema

35 ISX Goal To find a space-efficient storage scheme for XML data without compromising both query and update performances

36 Proposed Storage Structure
The ISX Structure

37 Sample DBLP XML Fragment

38 Balanced Parenthesis Encoding

39 Node Navigations

40 Topology Tiers No. of ) No. of ( No. of text nodes
Min, max of forward excess Min, max of backward excess

41 Primitive operators

42 Topology Tiers No. of ) No. of ( No. of text nodes
Min, max of forward excess Min, max of backward excess Excess 2 Where is the close tag?

43 Tier 2 excess

44 Efficient Updates

45 Example 100 MB DBLP document 5 million XML nodes ISX: 1MB topology

46 Another example 100M DBLP MSXML ISX Runtime (loading) 329MB 67MB
Core Duo 1.83GHz 1GB RAM 5400 RPM Harddrive MS Vista 100M DBLP MSXML ISX Runtime (loading) 329MB 67MB Loading time 17.8s 0.67s Runtime (//www) 333MB //www 1.814s 0.143s 5M DBLP MSXML ISX Runtime (loading) 15MB 4MB Loading time 0.54s 0.035s Runtime (//www) 21MB //www 0.096s 0.004s

47 ISX Features

48 Experiments Setup Fixed at 64MB memory buffer Up to 16 GB XML document
E.g. 16 GB DBLP contains > 770 million nodes NO index or query optimization has been employed for ISX (except for ISX Stream where TurboXPath algorithm has been employed)

49 Storage Size (ISX vs NoK)

50 Storage Size (ISX, XMill, XGrind): DBLP

51 Storage Size (ISX, XMill): TreeBank

52 Bulk Loading Performance

53 Queries

54 Q1: //inproceedings

55 Q5: //article[.//month/text() = “July”]//title

56 Other queries

57 XPath 13 axes We can navigate along 13 axes: ancestor ancestor-or-self
attribute child descendant descendant-or-self following following-sibling namespace parent preceding preceding-sibling self

58 Node Navigation

59 Full document traversal

60 Update (Insertion) Performance

61 ISX Summary Small storage footprint Small runtime footprint
Fast and consistent performance on navigational access Superior query performance (further indexing / query optimization can be added) Superior update performance

62 Compressing and Searching XML Data Via Two Zips
Paolo Ferragina et al. Slides modified from P. Ferragina’s

63 An XML excerpt It is verbose ! ... <dblp> <book>
<author> Donald E. Knuth </author> <title> The TeXbook </title> <publisher> Addison-Wesley </publisher> <year> 1986 </year> </book> <article> <author> Ronald W. Moore </author> <title> An Analysis of Alpha-Beta Pruning </title> <pages> </pages> <year> 1975 </year> <volume> 6 </volume> <journal> Artificial Intelligence </journal> </article> ... </dblp> It is verbose !

64 A tree interpretation... XML document exploration  Tree navigation
XML document search  Labeled subpath searches Subset of XPath [W3C]

65 The Problem XML-native search engines
We wish to devise a compressed representation for a labeled tree T that efficiently supports some operations: Navigational operations: parent(u), child(u, i), child(u, i, c) Subpath searches: given a sequence P of k labels Content searches: subpath + substring search Visualization operation: given a node, visualize its descending subtree XML-aware compressors (like XMill, XmlPpm, ScmPpm,...) need the whole decompression XML-native search engines might exploit this tool as a core block for query optimization and (compressed) storage XML-queriable compressors (like XPress, XGrind, XQzip,...) poor compression and scan of the whole (compressed) file Summary indexes (like Dataguide, 1-index or 2-index) large space and do not support “content” searches Theoretically do exist many solutions, starting from [Jacobson, IEEE Focs ’89] no subpath/content searches, and poor performance on labeled trees

66 A transform for “labeled trees” [Ferragina et al, IEEE Focs ’05]
We proposed the XBW-transform that mimics on trees the nice structural properties of the Burrows-and-Wheeler Trasform on strings The XBW linearizes the tree T in 2 arrays s.t.: the compression of T reduces to use any compressor (gzip, bzip,...) over these two arrays the indexing of T reduces to implement simple rank/select query operations over these two arrays

67 The XBW-Transform Sa Sp Step 1. C B A D c a b C B D c a A b e C B C
D B C A C D A C Step 1. Visit the tree in pre-order. For each node, write down its label and the labels on its upward path Permutation of tree nodes upward labeled paths

68 The XBW-Transform Sa Sp Step 2. C B A D c a b C b a D c B A e A C B C
D A C D B C Step 2. Stably sort according to Sp upward labeled paths

69 The XBW-Transform Sp Slast Sa Key fact Step 3. C B A D c a b XBW 1 C b
C b a D c B A e A C B C C D A C D B C Key fact Nodes correspond to items in <Slast,Sa> Step 3. Add a binary array Slast marking the rows corresponding to last children

70 XBzip – a simple XML compressor
Tags, Attributes and symbol = XBW is compressible: Sa and Spcdata are locally homogeneous Slast has some structure Pcdata

71 Some structural properties
B A D c a b C XBW B Slast Sa Sp 1 C b a D c B A e A C B C C D A C D B C B A B D c b a D D a c a c b Two useful properties: Children are contiguous and delimited by 1s Children reflect the order of their parents

72 XBW is navigational C Sp Slast Sa A 2 B 5 C 9 D 12 C B A D c a b C XBW
C b a D c B A e A C B C C D A C D B C A B Select in Slast the 2° item 1 from here... D c b a D D a Get_children c a c b Rank(B,Sa)=2 XBW is navigational: Rank-Select data structures on Slast and Sa The array C of |S| integers

73 XBW is searchable (count subpaths)
D 12 C B A D c a b P[i+1] XBW-index Slast Sa Sp P = B D 1 C b a D c B A e A C B C C D A C D B C fr Rows whose Sp starts with ‘B’ lr Their children have upward path = ‘D B’ Inductive step: Pick the next char in P[i+1], i.e. ‘D’ Search for the first and last ‘D’ in Sa[fr,lr]  Jump to their children XBW is searchable: Rank-Select data structures on Slast and Sa Array C of |S| integers fr lr 2 occurrences of P because of two 1s


Download ppt "Lecture 9: XML Compression"

Similar presentations


Ads by Google