TREECHOP: A Tree- based Query-able Compressor for XML Gregory Leighton, Tomasz Müldner, James Diamond Acadia University June 6, 2005.

Slides:



Advertisements
Similar presentations
XML-XSL Introduction SHIJU RAJAN SHIJU RAJAN Outline Brief Overview Brief Overview What is XML? What is XML? Well Formed XML Well Formed XML Tag Name.
Advertisements

System Integration and Performance
XML: Extensible Markup Language
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
SPECIAL TOPIC XML. Introducing XML XML (eXtensible Markup Language) ◦A language used to create structured documents XML vs HTML ◦XML is designed to transport.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Data Structures: A Pseudocode Approach with C 1 Chapter 6 Objectives Upon completion you will be able to: Understand and use basic tree terminology and.
Greedy Algorithms (Huffman Coding)
Semantic analysis Parsing only verifies that the program consists of tokens arranged in a syntactically-valid combination, we now move on to semantic analysis,
Information Retrieval in Practice
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
Extensible Markup Language XML MIS 520 – Database Theory Fall 2001 (Day) Lecture 14.
1 COS 425: Database and Information Management Systems XML and information exchange.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
XML(EXtensible Markup Language). XML XML stands for EXtensible Markup Language. XML is a markup language much like HTML. XML was designed to describe.
Tutorial 11 Creating XML Document
September 15, 2003Houssam Haitof1 XSL Transformation Houssam Haitof.
XML Compression Aslam Tajwala Kalyan Chakravorty.
Overview of Search Engines
XML Anisha K J Jerrin Thomas. Outline  Introduction  Structure of an XML Page  Well-formed & Valid XML Documents  DTD – Elements, Attributes, Entities.
School of Computing and Management Sciences © Sheffield Hallam University To understand the Oracle XML notes you need to have an understanding of all these.
XP New Perspectives on XML Tutorial 6 1 TUTORIAL 6 XSLT Tutorial – Carey ISBN
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
WORKING WITH XSLT AND XPATH
XML Syntax - Writing XML and Designing DTD's
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
Session IV Chapter 9 – XML Schemas
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Tutorial 1: XML Creating an XML Document. 2 Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content.
Electronic Commerce COMP3210 Session 4: Designing, Building and Evaluating e-Commerce Initiatives – Part II Dr. Paul Walcott Department of Computer Science,
1 XSLT An Introduction. 2 XSLT XSLT (extensible Stylesheet Language:Transformations) is a language primarily designed for transforming the structure of.
XML Documents Chao-Hsien Chu, Ph.D. School of Information Sciences and Technology The Pennsylvania State University Elements Attributes Comments PI Document.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.
XML Design Goals 1.XML must be easily usable over the Internet 2.XML must support a wide variety of applications 3.XML must be compatible with SGML 4.It.
Jennifer Widom XML Data Introduction, Well-formed XML.
Huffman Encodings Section 9.4. Data Compression: Array Representation Σ denotes an alphabet used for all strings Each element in Σ is called a character.
CS 157B: Database Management Systems II February 11 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron.
Chapter 1 Introduction Major Data Structures in Compiler
XML eXtensible Markup Language. XML A method of defining a format for exchanging documents and data. –Allows one to define a dialect of XML –A library.
Well Formed XML The basics. A Simple XML Document Smith Alice.
University of Nottingham School of Computer Science & Information Technology Introduction to XML 2. XSLT Tim Brailsford.
XP Tutorial 9New Perspectives on HTML and XHTML, Comprehensive 1 Working with XHTML Creating a Well-Formed Valid Document Tutorial 9.
Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.
XPath --XML Path Language Motivation of XPath Data Model and Data Types Node Types Location Steps Functions XPath 2.0 Additional Functionality and its.
1 Data Compression Hae-sun Jung CS146 Dr. Sin-Min Lee Spring 2004.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Introduction to XML Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
VCE IT Theory Slideshows by Mark Kelly study design By Mark Kelly, vceit.com, Begin.
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:
XML Extensible Markup Language
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
XML & JSON. Background XML and JSON are to standard, textual data formats for representing arbitrary data – XML stands for “eXtensible Markup Language”
XML: Extensible Markup Language
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Unit 4 Representing Web Data: XML
Data Compression.
Compressing XML Documents with Finite State Automata
Extensible Markup Language XML
ISNE101 – Introduction to Information Systems and Network Engineering
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Chapter 7 Representing Web Data: XML
XML Data Introduction, Well-formed XML.
More Sample XML By Sadia Anjum.
Advanced Algorithms Analysis and Design
Data Structure and Algorithms
Huffman Coding Greedy Algorithm
Presentation transcript:

TREECHOP: A Tree- based Query-able Compressor for XML Gregory Leighton, Tomasz Müldner, James Diamond Acadia University June 6, 2005

Outline XML TREECHOP  Compression Strategy  Decompression Strategy  Querying Strategy Experimental Results Conclusions

Extensible Markup Language (XML) What is it?  A standard for semi-structured data representation introduced in 1998  Data is surrounded by markup tokens (elements and attributes) used to indicate semantic meaning Characteristics?  Verbose (often 5 – 10 times larger than alternative formats like CSV)  Lots of repetition… plenty of opportunities for data compression

Example XML Document root element attribute data value comment

TREECHOP: Compression Strategy Parsing splits document into three segments:  Prologue: stores text occurring before document’s root element  Document Tree: contains all document contents between and including root element start and end tags  Epilogue: stores text occurring after document’s root element

Example XML Document Prologue Epilogue Document Tree

Document Tree Root node corresponds to document’s root element Character data segments are represented using leaf nodes XML markup represented using non-leaf nodes; 5 types of non-leaf nodes:  Element, attribute, CDATA, comment, processing instruction

Document Tree Generation Get next token from XML parser Construct tree node from token Write tree node to compression stream 12 3

Document Tree Nodes Each node in the tree has an associated label value, L  Element node  name of the element  Attribute node  + name of the attribute  Comment, CDATA, processing instruction nodes  all text between delimiting section markers The path for a node v n consists of /L 1 /L 2 …/L n where a route connecting the root node v 1 with v n consists of nodes v 1, v 2, …, v n and L i is the label for node v i

Codeword Generation A binary codeword is assigned to each non-leaf node, based on node path  Multiple nodes with identical path are assigned same codeword Codeword is used during decompression and querying operations to identify the value and type of each node

Codeword Generation The codeword C(v) assigned to a non-leaf node v with parent node p is formed by the concatenation of three codes  C(p): the codeword assigned to p  G(v): Golomb code assigned to v based on its ordering relative to p.  T(v): a sequence of 3 bits used to indicate node type

Example XML Document

Example Document Tree

Node PathC(v)C(v) /PurchaseOrder /PurchaseOrder/Date /PurchaseOrder/CustomerID /PurchaseOrder/Order /PurchaseOrder/Order/Item /PurchaseOrder/Order/Item/ProductNo /PurchaseOrder/Order/Item/Quantity Codeword Assignment C(p) – portion inherited from parent node G(v) – portion assigned based on Golomb code T(v) – portion used to indicate node type

TREECHOP: Writing the Tree Encoded tree is written to compression stream in depth-first order; gzip is applied to further compress the encoded tree Non-leaf nodes: written as 3-tuple (L, C, D)  L is a byte indicating bit length of code word  C is a sequence of  L / 8  bytes containing code word  D is the node’s label (e.g. element/attribute name) - reserved byte values are used to signal beginning/end of sequence of raw character data

TREECHOP: Writing the Tree On second and subsequent occurrences of a particular codeword, only the 2-tuple (L, C) is written (decoder is able to infer associated D) Leaf nodes are transmitted in same manner as D value for non-leaf nodes Each node encoding is transmitted immediately after node construction – avoids necessity of building entire tree in memory

TREECHOP: Decompression Strategy Decoder operates by reading node data from compression stream. For each non-leaf node: 1. Determine D value 2. Determine node type 3. Surround D with XML syntax appropriate to the node type and immediately emit to the decompression stream

TREECHOP: Querying Strategy An individual query handler is registered with the decoder for each query Single scan of compression stream is carried out, using a stack to keep track of current path When query predicate path is matched, the current codeword is recorded and remainder of compression stream is scanned for future occurrences Each time a query match is encountered, the associated D value is extracted from the compression stream and passed to the query handler for processing

Experimental Results: Compression Rates FileSize(KB)ElementsAttributesData (A) Baseball (B) Macbeth (C) 150emp (D) emp

Experimental Results: Compression/Decompression Speed Distance between sender/receiver: 20 km / 12 miles

Experimental Results: Querying Distance between sender/receiver: 20 km / 12 miles

Conclusions TREECHOP compresses at rates comparable to gzip, while also providing query-friendly annotations to the compression stream Using TREECHOP querying in place of alternative methods like XSLT yields a significant performance advantage on medium- to large-sized XML documents; advantage increases with document size