Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information.

Similar presentations


Presentation on theme: "1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information."— Presentation transcript:

1 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information Science University of Tokyo

2 2 Promising Techniques faster than PPMs decoding is much faster comparable performance with PPMs search data structure can find any substring memory efficient than suffix trees Block Sorting Compression [Burrows, Wheeler 94] Suffix Array [Manber, Myers 93] We unify compression and search by using them. Key: the Burrows-Wheeler Transformation (BWT)

3 3 Block Sorting Compression Burrows-Wheeler Transformation (BWT) performs permutation of text symbols in lexicographic order of their suffixes. Permuted text becomes more compressible.

4 4 Novel Feature of the Block Sorting BWT is defined by the suffix array (sorted indexes of suffixes) The suffix array is recovered from the compressed text Suffix array can be compressed by the Block Sorting! But, it cannot be used for case-insensitive search.

5 5 Our Contribution propose Modified Burrows-Wheeler Transformation –used for compressing text and its suffix array Decoded suffix array can be used for case-insensitive search. Any unification function is available. (not only case-insensitive search)

6 6 An Application Distributed Web Search Robots search robot collected text compress by Block Sorting xyz XYZ Web sites transfer via network search robot Abc ABC Web sites

7 7 Search Server suffix array on disk ABC Abc decode text suffix array merge into database XYZ xyz transfer via network 3 10 8 5 2 7... 14 2 8 3 9 5 10... 8 4 100 251 58...

8 8 The original BWT 3 ABCAb c 0 AbcAB C 4 BCAbc A 5 CAbcA B 1 bcABC A 2 cABCA b AABCbcAABCbc Input textBWTed text reverse BWT 0 AbcABC 1 bcABCA 2 cABCAb 3 ABCAbc 4 BCAbcA 5 CAbcAB sorting BWT 304512304512 suffix array

9 9 Unification unify capital/small letters (tolower) DCC = dcc unify double-byte codes and single-byte codes in Japanese EUC code ABC (a3c1 a3c2 a3c3) = ABC (41 42 43) unify Japanese Hiragana and Katakana あいうえお = アイウエオ We identify character equivalence.

10 10 Modified BWT 3 abc$ c 0 abcabc$ C 4 bc$ A 1 bcabc$ A 5 c$ B 2 cabc$ b Input text MBWTed text reverse BWT 0 abcabc$ 1 bcabc$ 2 cabc$ 3 abc$ 4 bc$ 5 c$ sorting MBWT AbcABC ccaabbccaabb aabbccaabbcc unify 304152304152 suffix array permutes symbols by suffix array of unified text reverse MBWT

11 11 Compression Ratio and Speed unification func. identical (BWT) normal (MBWT) LSB4 MSB4 zero (no BWT) comp. ratio 1.743 1.764 2.523 2.707 5.772 comp. time (s) 363.58 363.41 443.89 438.04 411.74 HTML files (total 90Mbytes) Block size: 9Mbytes small difference between BWT and MBWT MBWT provides case-insensitive searches.


Download ppt "1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information."

Similar presentations


Ads by Google