Introduction XML stands for eXtensible Markup Language. Designed to transport and store data; not to display it XML is similar to HTML, but tags are not.

Introduction XML stands for eXtensible Markup Language. Designed to transport and store data; not to display it XML is similar to HTML, but tags are not predefined. Tags are defined by users. XML is a W3C recommendation. The main idea is to compress well formed xml files, for an application, which are generated from database queries.

Xml file structures Data … Data … xml file head main xml ELEMENT xml ELEMENT by query’s row Xml ELEMENT by query’s col

Algorithm The algorithm takes advantages of the well defined structure of the xml files. Also, the frequency that row’s columns could have. This is the big deal of the algorithm! Some compression strategies, similar to Static Dictionary, where xml tags, and “DataKeys” are replace by unused Ascii characters.

Description Compression Algorithm The file is processed in two (2) phases. Phase One means figuring out xml tags, Ascii characters available, and DataKeys. DataKey are sorting by the following rule: Length(DataKey) * frequency – (Length(DataKey) + frequency). Any DataKey over availability is discarded. Example: Key len= 20, frequency= 10; means 30 instead of 200= 170 Key len= 15, frequency= 10; means 25 instead of 150= 125 Key len= 30, frequency= 5; means 35 instead of 150= 115

Description Compression Algorithm Phase II means reading again the xml file in order to create a new file with a header - built from the information taken from Phase I, and its detail is shown later-, to reconstruct the xml file, and replacing Tags/DataKeys by available Ascii Characters.

Description Compression Algorithm Rules to replace Tags/DataKeys –Main Tag is skipped –Row Tag, an Ascii char is assigned. –Column Tag, an Ascii char is assigned. –If Column Data is a DataKey If Ascii char is assigned, just Assigned Ascii Else Assigned Column Char + Column Data –Else Assigned Column Char + Column Data

Description Decompression Algorithm Read Header file –First four (4) Characters mean 1.Number of BitWise characters. -used Ascii chars. 2.First used Ascii char. 3.Number of Element tag. 4.Number of Data Keys set. –According to Char 4, reads pair Col/Num –According to Char 1, reads Bitwise –According to Char 3, reads Element String –According to Total Num from pairs, reads DK –Reads the rest of file replacing assigned Ascii

Application Syntax xmlzip [-c filename.xml] [-k column _1 … column_n]] | [-d filename.xzp] Where -c: Compressing -k: Column numbers to be Data Keys -d: Decompressing

HEADER Converted File NUMBITWISE STARTASCII NUMELEMENT DATAKEYNUM COLUMNNUMB SUBDATAKEY BITWISECHR. BITWISECHR ELEMENTSTR. ELEMENTSTR DATAKEYSTR. DATAKEYSTR NULLCHARAC 12 32 8 1 3 160 10100000 7 00000111 231 11010101 207 11001111 127 01111111 255 11111111 188 10111100 64 01000000 125 01111101 223 11011111 254 11111110 192 11000000 CATALOG CD TITLE ARTIST COUNTRY COMPANY PRICE YEAR USA UK Colombia 0 … We can notice Header Length is proportional to characters found in XML file, XML file Elements, and Datakey founds in XML file: NUMELEMENT ∑ SUBDATAKEY H = 4 + DATAKEYNUM*2 + NUMBITWISE + ∑ [length(ELEMENTSTRi)+1] + ∑ [DATAKEYSTRj)+1] + 1 i=1 j=1 In this case, the file HEADER is: H= 4 + 2 * 1 + 12 + 8 + 3 + 6 + 7 + 8 + 8 + 6 + 5 + 4 + 3 + 9 + 1= 87

Empire Burlesque Bob Dylan USA Columbia 10.90 1985 Hide your heart Bonnie Tyler UK CBS Records 9.90 1988 Thriller Michael Jackson USA Columbia 11.90 1985 Love Songs Bee Gee UK Records 12.00 1980 Oral Fixation Shaquira Colombia Epic 18.70 2006 HEADER @ &Empire Burlesque !Bob Dylan % *Columbia $10.90 #1985 @ &Hide your heart !Bonnie Tyler ~ *CBS Records $9.90 #1988 @ &Thriller !Michael Jackson % *Columbia $11.90 #1985 @ &Love Songs !Bee Gee ~ *Records $12.00 #1980 @ &Oral Fixation !Shaquira ^ *Epic $18.70 #2006

Next The next step is to make the algorithm generic. I mean the algorithm feature of taking column frequency advantage. It can be exploited by Tag’s name instead of column number. I didn’t try to make it available because of time, but it’s a good point in order to avoid any conflict due to column order. Also, it’s necessary the implementation of xml Attribute recognition. It’s almost done so far, but I didn’t keep going because of time constraint. It would be a good implementation that the user could say -by parameters- which specific Attribute is going to be taken into account. A good example is that Element’s Tags, and Attributes Tags could share the same name, even thought they are different data type. Finally, but not least, complete the implementation of a modified PPM algorithm. The first task would be adding to the HEADER those DataKey over the available Ascii chars holding the condition: Length(DataKey) > Largest Context, and frequency >1 –at least. In order to add them to a “temporary” count array, where the size of the DataKey no matter.

Introduction XML stands for eXtensible Markup Language. Designed to transport and store data; not to display it XML is similar to HTML, but tags are not.

Similar presentations

Presentation on theme: "Introduction XML stands for eXtensible Markup Language. Designed to transport and store data; not to display it XML is similar to HTML, but tags are not."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction XML stands for eXtensible Markup Language. Designed to transport and store data; not to display it XML is similar to HTML, but tags are not.

Similar presentations

Presentation on theme: "Introduction XML stands for eXtensible Markup Language. Designed to transport and store data; not to display it XML is similar to HTML, but tags are not."— Presentation transcript:

Similar presentations

About project

Feedback