Compressing XML Documents with Finite State Automata

Slides:



Advertisements
Similar presentations
XML-XSL Introduction SHIJU RAJAN SHIJU RAJAN Outline Brief Overview Brief Overview What is XML? What is XML? Well Formed XML Well Formed XML Tag Name.
Advertisements

Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.
1 DTD (Document Type Definition) Imposing Structure on XML Documents (W3Schools on DTDs)W3Schools on DTDs.
© De Montfort University, XML – a meta language Howell Istance and Peter Norris School of Computing De Montfort University.
1 XML DTD & XML Schema Monica Farrow G30
TREECHOP: A Tree- based Query-able Compressor for XML Gregory Leighton, Tomasz Müldner, James Diamond Acadia University June 6, 2005.
XPath Eugenia Fernandez IUPUI. XML Path Language (XPath) a data model for representing an XML document as an abstract node tree a mechanism for addressing.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
Document Type Definition DTDs CS-328. What is a DTD Defines the structure of an XML document Only the elements defined in a DTD can be used in an XML.
Introduction to XLink Transparency No. 1 XML Information Set W3C Recommendation 24 October 2001 (1stEdition) 4 February 2004 (2ndEdition) Cheng-Chia Chen.
1 SCHEMALESS APPROACH OF MAPPING XML DOCUMENTS INTO RELATIONAL DATABASE Ibrahim Dweib, Ayman Awadi, Seif Elduola Fath Elrhman, Joan Lu CIT 2008 Sydney,
1 Document Type Descriptors (DTDs) Imposing Structure on XML Documents.
Full declaration When an element is declared to have element content, the children element types must also be declared Example: to which the following.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
XML Compression Aslam Tajwala Kalyan Chakravorty.
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
Main challenges in XML/Relational mapping Juha Sallinen Hannes Tolvanen.
XP New Perspectives on XML Tutorial 4 1 XML Schema Tutorial – Carey ISBN Working with Namespaces and Schemas.
Tutorial 3: XML Creating a Valid XML Document. 2 Creating a Valid Document You validate documents to make certain necessary elements are never omitted.
XP New Perspectives on XML Tutorial 3 1 DTD Tutorial – Carey ISBN
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Validating DOCUMENTS with DTDs
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Document Type Definition.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 XML Taken from Chapter 7.
Chapter 10: XML.
XML CPSC 315 – Programming Studio Fall 2008 Project 3, Lecture 1.
TDDD43 XML and RDF Slides based on slides by Lena Strömbäck and Fang Wei-Kleiner 1.
Document Type Definitions Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
XP 1 DECLARING A DTD A DTD can be used to: –Ensure all required elements are present in the document –Prevent undefined elements from being used –Enforce.
The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
1 Tutorial 13 Validating Documents with DTDs Working with Document Type Definitions.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Avoid using attributes? Some of the problems using attributes: Attributes cannot contain multiple values (child elements can) Attributes are not easily.
An Improved Algorithm to Accelerate Regular Expression Evaluation Author: Michela Becchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture.
ECA 228 Internet/Intranet Design I XSLT Example. ECA 228 Internet/Intranet Design I 2 CSS Limitations cannot modify content cannot insert additional text.
CIS 451: XML DTDs Dr. Ralph D. Westfall February, 2009.
University of Crete Department of Computer Science ΗΥ-561 Web Data Management XML Data Archiving Konstantinos Kouratoras.
Database Systems Part VII: XML Querying Software School of Hunan University
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
Lecture 16 Introduction to XML Boriana Koleva Room: C54
An Introduction to XML Sandeep Bhattaram
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Understanding How XML Works Ellen Pearlman Eileen Mullin Programming the.
XML Introduction. What is XML? XML stands for eXtensible Markup Language XML stands for eXtensible Markup Language XML is a markup language much like.
XML Design Goals 1.XML must be easily usable over the Internet 2.XML must support a wide variety of applications 3.XML must be compatible with SGML 4.It.
The Official 2002 XML Marathon April 4, Revised Requirements A photocopy of the original text A short description (read: single paragraph) discussing.
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
University of Nottingham School of Computer Science & Information Technology Introduction to XML 2. XSLT Tim Brailsford.
XML Extensible Markup Language
XML Notes taken from w3schools. What is XML? XML stands for EXtensible Markup Language. XML was designed to store and transport data. XML was designed.
Extensible Markup Language (XML) Pat Morin COMP 2405.
1 Representing and Reasoning on XML Documents: A Description Logic Approach D. Calvanese, G. D. Giacomo, M. Lenzerini Presented by Daisy Yutao Guo University.
XML: Extensible Markup Language
Querying and Transforming XML Data
Huffman Coding, Arithmetic Coding, and JBIG2
The XML Language.
Storing and Querying XML Documents Without Using Schema Information
Managing XML and Semistructured Data
(b) Tree representation
Chapter 11 Data Compression
New Perspectives on XML
Alin Deutsch, University of Pennsylvania Mary Mernandez, AT&T Labs
Early Profile Pruning on XML-aware Publish-Subscribe Systems
XML Query Processing Yaw-Huei Chen
DTD (Document Type Definition)
Presentation transcript:

Compressing XML Documents with Finite State Automata S. Hariharan Priti Shankar

Organization An introduction to XAUST A brief introduction to XML & DTD Previous work : XMill Utility of DTD Implementation Experimental results Conclusion

XAUST: Introduction XML Compression with AUtomata and STack It is the only tool using DTD for compressing documents automatically It generates code for an arbitrary DTD and compresses documents conforming to that DTD

XML The presence of tags makes the document verbose XML describes a class of data objects called XML documents The presence of tags makes the document verbose Attributes are treated as tags XML document is characterized as a tree root student id name 1 Hariharan opentag <root> <student id = “1”> <name> Hariharan </name> </student> </root> attribute content closetag

DTD DTD describes a document DTD for the above XML fragment is given below <!DOCTYPE StudentInfo [ <!ELEMENT root (student*)> <!ELEMENT student (name | rollNo, comment?)> <!ATTLIST student id ID #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT rollNo (#PCDATA)> <!ELEMENT comment (#PCDATA)> ]> * means zero or more occurrences + means one or more occurrences ? means zero or one occurrence | means either-or , means and #PCDATA means textual content

Arithmetic coding Arithmetic coding replaces a string of symbols with a single floating number Symbol Probability Range a 0.2 [0, 0.2) b 0.3 [0.2, 0.5) c 0.1 [0.5, 0.6) d [0.6, 0.8) e [0.8, 0.9) ! [0.9, 1.0) We wish to encode bacc! After seeing b the encoder narrows it down to the interval [0.2, 0.5) After seeing a the interval is further narrowed down to 1/5th of itself Hence the new interval becomes [0.2, 0.6) After seeing the first c the interval becomes [0.23, 0.236) The final interval is [0.23354, 0.2336)

Previous work: XMill XMill separates structure from text XMill has structure container and text containers Consider the XML fragment <book> <title> xyz </title> <isbn> 123 </isbn> </book> book = #1, title = #2, isbn = #3 Structure = #1 #2 C1 / #3 C2 / / There will be many repeated sequences of structure like the one above and are compressed The text is mapped to the container depending on the path from the root

Motivation for XAUST XMill does not use DTD We propose a scheme using DTD to achieve better compression ratio Automatic multiplexing is achieved

Utility of DTD Consider the Elements addressBook and card <!ELEMENT addressBook ( card* )> <!ELEMENT card ( (name | (givenName, familyName)), email, note? )> card name email givenName familyName note addressBook A sample fragment of the XML document Tags need not be encoded

Regular Expression regular expression <!ELEMENT card ( (name | (givenName, familyName)), email, note? )> The DFA for the corresponding regular expression is given below 1 2 3 4 5 familyName name givenName email note

Regular Expression Contd., 1 2 3 4 familyName name givenName email 5 note We can see that we need not encode familyName and email tags They are the states with single output transition But state 4 is a state with multiple transition. There is an implicit transition to the parent ELEMENT addressBook ( <!ELEMENT addressBook (card*)> )

Encoding text The tree for the DTD Consider a fragment of DTD #PCDATA B A D Consider a fragment of DTD <!ELEMENT A (C, B, D) > <!ELEMENT B (C) > <!ELEMENT C (#PCDATA) > <!ELEMENT D (#PCDATA) > 3 choices for encoding text A single container for all text A single container for each element A single container for each leaf node of the DTD tree

Encoding text contd., Advantages and disadvantages of the 3 approaches A single container for all the text Advantage: Low memory consumption Disadvantage: Low compression ratio A single container for each Element Advantage: Medium memory consumption A single container for each leaf node in the DTD tree Disadvantage: High memory consumption

Encoding text contd., Not much difference in the compression ratio between the options 2 and 3 Option 2 i.e., ‘A single container for each Element’ was chosen as it entails less memory consumption As text is character data we used Arithmetic compression Order-4 Adaptive Arithmetic compression is used

Implementation An automaton is generated for each Element Stack is used for storing the current state of ancestor element When encountering Action taken Open tag Push the current state in stack. Start the automaton of the child element Close tag Pop the current state from stack. Make state transition Text Encode the text

Experiments We conducted our experiments on XMark, DBLP, Uniprot, Michigan and X007 documents XAUST is compared with XMill, XMLPPM and gzip XMLPPM ran out of memory for Uniprot and Michigan documents XMill, XMLPPM and gzip are better than XAUST for one document each Compression ratio is defined as the ratio between the size of the compressed document and the size of the original document

Document size Name Size (in MB) Auction 113 Dblp 253 Uniprot 1070 Michigan 495 X007 128

Compression ratio of document

Compression ratio of tags

Conclusion and Future Work The results amply justify the utility of DTD Large documents are compressed without running out of memory The only restriction placed by XAUST is the presence of DTD Future work will concentrate on querying the compressed document

References Hartmut Liefke, Dan Suciu.: XMill: An efficient compressor for XML data, Proceedings of ACM SIGMOD, 2000 Nelson, M: Arithmetic Coding. Dr. Dobbs Journal http://dogma.net/markn/articles/arith/part1.htm UniProt : http://www.ebi.uniprot.org Michigan: http://www.eecs.umich.edu/db/mbench XOO7: http://www.comp.nus.edu.sg/ebh/XOO7.html XMark: http://monetdb.cwi.nl/xml/generator.html DBLP: http://www.informatik.uni-trier.de/~ley/db

Thank you