Presentation is loading. Please wait.

Presentation is loading. Please wait.

Compressing XML Documents with Finite State Automata

Similar presentations


Presentation on theme: "Compressing XML Documents with Finite State Automata"— Presentation transcript:

1 Compressing XML Documents with Finite State Automata
S. Hariharan Priti Shankar

2 Organization An introduction to XAUST
A brief introduction to XML & DTD Previous work : XMill Utility of DTD Implementation Experimental results Conclusion

3 XAUST: Introduction XML Compression with AUtomata and STack
It is the only tool using DTD for compressing documents automatically It generates code for an arbitrary DTD and compresses documents conforming to that DTD

4 XML The presence of tags makes the document verbose
XML describes a class of data objects called XML documents The presence of tags makes the document verbose Attributes are treated as tags XML document is characterized as a tree root student id name 1 Hariharan opentag <root> <student id = “1”> <name> Hariharan </name> </student> </root> attribute content closetag

5 DTD DTD describes a document
DTD for the above XML fragment is given below <!DOCTYPE StudentInfo [ <!ELEMENT root (student*)> <!ELEMENT student (name | rollNo, comment?)> <!ATTLIST student id ID #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT rollNo (#PCDATA)> <!ELEMENT comment (#PCDATA)> ]> * means zero or more occurrences + means one or more occurrences ? means zero or one occurrence | means either-or , means and #PCDATA means textual content

6 Arithmetic coding Arithmetic coding replaces a string of symbols with a single floating number Symbol Probability Range a 0.2 [0, 0.2) b 0.3 [0.2, 0.5) c 0.1 [0.5, 0.6) d [0.6, 0.8) e [0.8, 0.9) ! [0.9, 1.0) We wish to encode bacc! After seeing b the encoder narrows it down to the interval [0.2, 0.5) After seeing a the interval is further narrowed down to 1/5th of itself Hence the new interval becomes [0.2, 0.6) After seeing the first c the interval becomes [0.23, 0.236) The final interval is [ , )

7 Previous work: XMill XMill separates structure from text
XMill has structure container and text containers Consider the XML fragment <book> <title> xyz </title> <isbn> 123 </isbn> </book> book = #1, title = #2, isbn = #3 Structure = #1 #2 C1 / #3 C2 / / There will be many repeated sequences of structure like the one above and are compressed The text is mapped to the container depending on the path from the root

8 Motivation for XAUST XMill does not use DTD
We propose a scheme using DTD to achieve better compression ratio Automatic multiplexing is achieved

9 Utility of DTD Consider the Elements addressBook and card
<!ELEMENT addressBook ( card* )> <!ELEMENT card ( (name | (givenName, familyName)), , note? )> card name givenName familyName note addressBook A sample fragment of the XML document Tags need not be encoded

10 Regular Expression regular expression
<!ELEMENT card ( (name | (givenName, familyName)), , note? )> The DFA for the corresponding regular expression is given below 1 2 3 4 5 familyName name givenName note

11 Regular Expression Contd.,
1 2 3 4 familyName name givenName 5 note We can see that we need not encode familyName and tags They are the states with single output transition But state 4 is a state with multiple transition. There is an implicit transition to the parent ELEMENT addressBook ( <!ELEMENT addressBook (card*)> )

12 Encoding text The tree for the DTD Consider a fragment of DTD
#PCDATA B A D Consider a fragment of DTD <!ELEMENT A (C, B, D) > <!ELEMENT B (C) > <!ELEMENT C (#PCDATA) > <!ELEMENT D (#PCDATA) > 3 choices for encoding text A single container for all text A single container for each element A single container for each leaf node of the DTD tree

13 Encoding text contd., Advantages and disadvantages of the 3 approaches
A single container for all the text Advantage: Low memory consumption Disadvantage: Low compression ratio A single container for each Element Advantage: Medium memory consumption A single container for each leaf node in the DTD tree Disadvantage: High memory consumption

14 Encoding text contd., Not much difference in the compression ratio between the options 2 and 3 Option 2 i.e., ‘A single container for each Element’ was chosen as it entails less memory consumption As text is character data we used Arithmetic compression Order-4 Adaptive Arithmetic compression is used

15 Implementation An automaton is generated for each Element
Stack is used for storing the current state of ancestor element When encountering Action taken Open tag Push the current state in stack. Start the automaton of the child element Close tag Pop the current state from stack. Make state transition Text Encode the text

16 Experiments We conducted our experiments on XMark, DBLP, Uniprot, Michigan and X007 documents XAUST is compared with XMill, XMLPPM and gzip XMLPPM ran out of memory for Uniprot and Michigan documents XMill, XMLPPM and gzip are better than XAUST for one document each Compression ratio is defined as the ratio between the size of the compressed document and the size of the original document

17 Document size Name Size (in MB) Auction 113 Dblp 253 Uniprot 1070
Michigan 495 X007 128

18 Compression ratio of document

19 Compression ratio of tags

20 Conclusion and Future Work
The results amply justify the utility of DTD Large documents are compressed without running out of memory The only restriction placed by XAUST is the presence of DTD Future work will concentrate on querying the compressed document

21 References Hartmut Liefke, Dan Suciu.: XMill: An efficient compressor for XML data, Proceedings of ACM SIGMOD, 2000 Nelson, M: Arithmetic Coding. Dr. Dobbs Journal UniProt : Michigan: XOO7: XMark: DBLP:

22 Thank you


Download ppt "Compressing XML Documents with Finite State Automata"

Similar presentations


Ads by Google