Download presentation

Presentation is loading. Please wait.

Published byBailey Bury Modified over 2 years ago

1
北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA

2
The 7th Development of new compression method for pattern matching ～ Improving VF codes ～ Which comp. methods are suitable? VF code STVF code Improvement with allowing incomplete internal nodes Improvement with iterative learning Conclusion

3
北海道大学 Hokkaido University Which comp. methods are suitable? Key features for fast CPM –having clear code boundaries End-tagged codes, byte-oriented codes, or fixed-length codes –using a static and compact dictionary Methods like Huffman coding –achieving high compression ratios to reduce the amount of data to be processed. Lecture on Information knowledge network 2011/11/29 VF codes (Variable-length-to-Fixed-length Code) Tunstall code （ Tunstall 1967 ） AIVF code （ Yamamoto&Yokoo 2001 ） 3

4
北海道大学 Hokkaido University VF code and the others VV codes are the mainstream from comp. ratios It is difficult for VF codes to gain high compression ratios since the codewords are fixed –There has been no practical use of existent VF codes! Compressed text （ codeword ） Fixed lengthVariable length Input text （ source symbol ） Fixed length FF code (Fixed length to Fixed length code) FV code (Fixed length to Variable length code) Variable length VF code (Variable length to Fixed length code) VV code (Variable length to Variable length code) Huffman code Tunstall code LZ family, etc. 2011/11/29 Lecture on Information knowledge network 4

5
北海道大学 Hokkaido University VF coding using a parse tree Repeat the followings: 1.Reading symbols from the input text one by one. 2.Parse the text when the traversal reaches to a leaf. 3.Encode the parsed block to the number of the leaf. Each leaf of the parse tree corresponds to a string, and it is numbered, which represents a codeword. Lecture on Information knowledge network 2011/11/29 a b c a bc a bc a bc 012 34 5 678 abbaabbaaacacc Input text ： 3515068 Coded text ： Parse tree T: 5

6
北海道大学 Hokkaido University Tunstall code B. P. Tunstall, “Synthesis of noiseless compression codes,” Ph.D. dissertation, Georgia Ins. Technol., Atlanta, GA, 1967. It uses a complete k-ary tree for the source alphabet ∑ (|∑| = k). It is the optimal VF code for memory-less information sources, where the occurrence probability for each symbol is given and unchanged. Tunstall tree T m : ∑ = {a, b, c}, k = 3, #internal nodes m = 3, P(a) = 0.5, P(b) = 0.2, P(c) = 0.3, Ex) Code the input S = abacaacbabcc with T m S = ab ・ ac ・ aa ・ cb ・ ab ・ cc Z = 001 010 000 101 001 110 2011/11/29 Lecture on Information knowledge network Each leaf number is represented by a fixed-length binary code, whose length is 3. (2 3 = 8) a b c 0.50.20.3 a bc 0.250.10.15 a bc 0.060.09 000001010 011 100101110 6

7
北海道大学 Hokkaido University Construct the optimal tree T m * which makes the average block length maximum. –Given an integer m ≧ 1 and the occurrence probability P(a) （ a ∈ ∑ ）, the construction of T m * is as follows : 1.Let T 1 * be the initial parse tree, which is the complete k-ary tree whose length is equal to 1. 2.Repeat the following steps for i = 2, …, m. A) Choose a node v=v * i whose probability is the maximum among all the leaves in the current parse tree T i –1 *. B) Add T 1 onto v * i to make T i *. Ex) Tunstall tree T m * ∑ = {a, b, c} k = 3, m = 4 P(a) = 0.5 P(b) = 0.2 P(c) = 0.3 a b c 0.50.20.3 How to construct Tunstall tree 2011/11/29 Lecture on Information knowledge network a bc 0.250.10.15 a bc 0.060.09 a bc 0.1250.050.075 7

8
北海道大学 Hokkaido University Basic idea Utilize the suffix tree for the input text as a parse tree. –The suffix tree[Weiner, 1973] for T has the complete statistical information about T. Suffix tree ST(T) for T is a dictionary tree for ANY substrings in T. → We cannot use ST(T) as is, since it includes T itself! –We have to prune the tree properly to make a parse tree. –The pruning must be done so that there are nodes whose frequencies are high. Note that we have to store the pruned suffix tree since we need it when we decompress the encoded text. Kida presented at DCC2009. Unfortunately, however, there was the foregoing work! Their work is very similar to mine… –Klein, S.T., and Shapira, D, “Improved variable-to-fixed length codes,” In : SPIRE2008, pp. 39-50, 2008. Lecture on Information knowledge network 2011/11/29 8

9
北海道大学 Hokkaido University Suffix tree –A compacted dictionary trie for all the substrings of the input text. –For the input text of length n, its size is O(n). –Using suffix tree ST(T), any substring can be searched in time linear to the length of the substring. –There are O(n) online construction algorithms. E. Ukkonen, Constructing suffix trees on-line in linear time, Proc. of IFIP’92, pp.484-492. STVF codes: parse tree by pruned suffix tree 2 4 3 3 2 6 2011/11/29 Lecture on Information knowledge network Suffix tree ST(S); S=abbcabcabbcabd 9

10
北海道大学 Hokkaido University Suffix tree –A compacted dictionary trie for all the substrings of the input text. –For the input text of length n, its size is O(n). –Using suffix tree ST(T), any substring can be searched in time linear to the length of the substring. –There are O(n) online construction algorithms. E. Ukkonen, Constructing suffix trees on-line in linear time, Proc. of IFIP’92, pp.484-492. Suffix tree ST(S); S=abbcabcabbcabd STVF codes: parse tree by pruned suffix tree 2 4 3 3 2 6 2011/11/29 Lecture on Information knowledge network 10

11
北海道大学 Hokkaido University Suffix tree –A compacted dictionary trie for all the substrings of the input text. –For the input text of length n, its size is O(n). –Using suffix tree ST(T), any substring can be searched in time linear to the length of the substring. –There are O(n) online construction algorithms. E. Ukkonen, Constructing suffix trees on-line in linear time, Proc. of IFIP’92, pp.484-492. Suffix tree ST(S); S=abbcabcabbcabdParse tree PT(S) b b a b d a b a b b a b c c d c c c b a d 000 001 010 110 111 100 101 011 STVF codes: parse tree by pruned suffix tree 2 4 3 3 2 6 000 S = abbcab ・ cab ・ bcab ・ d 011110100 2011/11/29 Lecture on Information knowledge network 11

12
北海道大学 Hokkaido University 1.Construct suffix tree ST(T$). 2.Let T 1 ’ be the initial candidate tree, which is the pruned suffix tree ST k+1 (T$) that consists the root of ST(T$) and its children. 3.Choose a node v whose frequency is the highest among all the leaves in T i ’ = ST L i (T$) in the sense of the frequency in ST(T$). Let L i be the total number of the leaves in T i ’, and let C v be the number of the children of v. 4.If L i + C v – 1 ≦ 2 l holds, add all the children of v onto T i ’ as new leaves and make it the new candidate T i+1 ’. If child u of v is a leaf in ST(T$), however, chop the label from v to u off except the first one character. 5.Repeat Step 3 and 4 while T i ’ can be extended. Frequency-based pruning algorithm 2011/11/29 Lecture on Information knowledge network 12

13
北海道大学 Hokkaido University Experimental results (comp. ratios) MethodsE.colibible.txtworld192.tx t dazai.utf.txt1000000.txt Huffman code25.00%54.82%63.03%57.50%59.61% Tunstall code(8)27.39%72.70%85.95%100.00%76.39% Tunstall code(12) 26.47%64.89%77.61%69.47%68.45% Tunstall code(16) 26.24%61.55%70.29%70.98%65.25% STVF code (8)25.09%66.59%80.76%73.04%74.25% STVF code (12)25.10%50.25%62.12%52.99%68.90% STVF code (16)28.90%42.13%49.93%41.37%78.99% Lecture on Information knowledge network 2011/11/29 Figures in （） indicate the length of codeword (bits) Text dataSize (byte)|∑||∑|Contents E.coli46386904Complete genome of the E. Coli bacterium bible.txt404739263The King James version of the bible world192.txt247340094The CIA world fact book dazai.utf.txt7268943141The all works of Osamu Dazai (UTF-8 encoded) 1000000.txt100000026The random string automatically generated ※ From Canterbury Corpus(http://corpus.canterbury.ac.nz/) and J-TEXTS(http://www.j-texts.com/)http://corpus.canterbury.ac.nz/http://www.j-texts.com/ 13

14
北海道大学 Hokkaido University Comp. ratio to codeword length 2011/11/29 Lecture on Information knowledge network The used text is bible.txt (The King James version of the bible; 3.85MB) 14

15
北海道大学 Hokkaido University Compression methods –STVF Coding –Tunstall + Range Coder –STVF Coding + Range Coder Data –English Text (The bible of King James, 4MB, |Σ|=63) Environments –CPU:Intel® Xeon® processor 3.00GHz dual core –Memory:12GB –OS:Red Hat Enterprise Linux ES Release 4 Codeword Length –l = 8-16 bits Improvement by combining with range coder Compare each compression ratios, compression times, and decompression times 2011/11/29 Lecture on Information knowledge network 15

16
北海道大学 Hokkaido University Results (codeword length-comp. ratio) 2011/11/29 Lecture on Information knowledge network 16

17
北海道大学 Hokkaido University Results (codeword length-comp. time) 2011/11/29 Lecture on Information knowledge network 17

18
北海道大学 Hokkaido University Results (codeword length-decomp. time) 2011/11/29 Lecture on Information knowledge network 18

19
北海道大学 Hokkaido University Take a breath 19 Lecture on Information knowledge network 2011/11/29 2011.11.03 Sendai Dai-Kannon @Daikanmitsu-ji

20
北海道大学 Hokkaido University Take a breath 20 Lecture on Information knowledge network 2011/11/29

21
北海道大学 Hokkaido University Problem of the original STVF code –The bottleneck is the shape of the parse tree. –Even if we choose a node whose frequency is high, not all the children have high frequencies! Lecture on Information knowledge network 2011/11/29 f = 1000 f = 400 f = 1 f = 3 f = 500 f = 2 In STVF coding, all the children are added! Useless leaves! 21

22
北海道大学 Hokkaido University Improvement with allowing incomplete internal nodes Adding nodes one by one in the frequency order must be better! –We allow to assign a codeword to an incomplete internal node –The coding procedure is modified so that it can encode instantaneously (like AIVF code[Yamamoto&Yokoo2001]). Output a codeword when the traversal fails. Resume the traversal from the root by the character which lead the fail. Lecture on Information knowledge network 2011/11/29 f = 1000 f = 300 f = 1 f = 500 Choose nodes whose frequencies are high enough 22

23
北海道大学 Hokkaido University Difference between the original and this Parse tree of the original STVF code (T = BABCABABBABCBAC, k = 3) Improved parse tree Longer strings are likely to be added into the tree. In the one-by-one method, less frequent leaves are chopped and longer strings tend to be assigned codewords. 2011/11/29 Lecture on Information knowledge network 23

24
北海道大学 Hokkaido University Experiments Text data to be used –Bible.txt (The Canterbury Corpus), 3.85MB Method –Compare with the original STVF code at the codeword length of 16 bit Result –The one-by-one method improves 18.7% in comp. ratio and 22.2% in comp. pattern matching speed. Methods Comp. time Comp. ratioComp. PM time original STVF 6109ms42.1%7.27ms One-by-one6593ms34.2%5.67ms On Intel Core 2 Duo T7700, 2GB RAM, Windows Vista, Visual C++ 2008 2011/11/29 Lecture on Information knowledge network 24

25
北海道大学 Hokkaido University Improvement with iterative learning Not all substrings that occur frequently in the input text do not appear frequently in the coded sequence of blocks! Shall we construct the optimal parse tree? ⇒ We have to choose substrings that are actually used when we encode the text, which are entered in a dictionary. ⇒ Boundaries of parsed blocks will vary for the dictionary. ⇒ Which comes first, the chicken or the egg? ⇒ hard as NP-complete How shall we brush up a parse tree? –Encode the text iteratively and choose useful nodes to brush up! Lecture on Information knowledge network 2011/11/29 25

26
北海道大学 Hokkaido University Idea of brushing-up a parse tree Lecture on Information knowledge network 2011/11/29 26

27
北海道大学 Hokkaido University Experiments TextsSize(byte)|Σ|Contents GBHTG11981,173,7874DNA sequences DBLP200390,510,23697XML data Reuters-2157818,805,335103English texts Mainichi199178,911,178256Japanese texts(UTF-8) Lecture on Information knowledge network 2011/11/29 We compared with BPEX, ETDC, SCDC, gzip, and bzip2 for the above text data. C++ compiled by g++ of GNU v 3.4 Intel Xeon® 3GHz and 12GB of RAM, running Red Hat Enterprise Linux ES Release 4. 27

28
北海道大学 Hokkaido University Compression ratios Lecture on Information knowledge network 2011/11/29 28

29
北海道大学 Hokkaido University Compression times Lecture on Information knowledge network 2011/11/29 29

30
北海道大学 Hokkaido University Decompression times Lecture on Information knowledge network 2011/11/29 30

31
北海道大学 Hokkaido University The 7th summary Key features for fast CPM –having clear code boundaries –using a static and compact dictionary –achieving high compression ratios VF coding is a promising compression method! –We developed a new VF code, named STVF code, which uses a pruned suffix tree as a parse tree. –Improvement with allowing incomplete internal nodes –Improvement with iterative learning –VF codes reach to the level of the state-of-the-art methods like Gzip and BPEX in compression ratios! Future works: –reach to the level of bzip2! –develop efficient pattern matching algorithms on the improved STVF codes Implementation of BM type algorithm on STVF codes Lecture on Information knowledge network 2011/11/29 31

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google