Presentation is loading. Please wait.

Presentation is loading. Please wait.

Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.

Similar presentations


Presentation on theme: "Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa."— Presentation transcript:

1

2 Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa † * Dept. of Computer Science, University of Chile † Dept. of Informatics, Kyushu University

3 Contents Introduction –Motivation –Related works and our goal Our search approach on LZ78/LZW –Basic idea – Filtration technique –Multiple pattern matching algorithms on compressed text Experimental results Conclusion

4 Motivation Compressed pattern matching –Let sleeping files lie. –Reduce space, reduce searching time. File transfer on Memory Search on Secondary disk storage Decompress on Memory

5 Motivation File transfer on Memory on Secondary disk storage Search directly Compressed pattern matching –Let sleeping files lie –Reduce space, reduce searching time

6 Related Works (1) 1988 Eliam-Tzoreff and Vishkinrun-length 1992 Amir, Landau, and Vishkin two-dimensional run-length 1995 Farach and ThorupLZ77 1996 Amir, Benson and FarachLZW 1997 Karpinski, Rytter, and Shinoharastraight-line programs 1996 Gąsieniec, et al.LZ77 1997 Miyazaki, Shinohara, and Takedastraight-line programs 1992 Amir and Benson two-dimensional run-length Amir, Benson, and Farach 1994 two-dimensional run-length 1997 Takedafinite state encoding 1998 Shibata, et al.byte pair encoding 1994 Manberoriginal compression scheme 1998 Miyazaki, et al.Huffman encoding 1998 Kida, et al.LZ78/LZW yearresearchercompression 1998 Moura, et al. Word based encoding

7 Related Works (2) yearresearchercompression 1999 Shibata, et al.Antidictionary based 1999 Kida, et al. LZ78/LZW 2000 Shibata, et al. collage systems 1999 Navarro and Raffinot LZ family, Hybrid LZ Kida, et al. 1999 Dictionary based methods (Collage system) 2000 Kärkkäinen, Navarro and Ukkonen LZ family 2000 Matsumoto, et al. Simple collage systems 2000 Navarro and Tarhio LZ family 1999 Gąsieniec and Rytter LZW 2000 Klein and Shapira LZSS variant 2001 Klein and Shapira Huffman encoding

8 Approximate String Matching Edit distance ed(P, P’) –Insertions, deletions and replacements Report all occurrences of any string P’ s.t. ed(P, P’)  k for a given pattern P. Survey paper G. Navarro. A guided tour to approximate string matching. ACM Computing Surverys, 2000. Text: ACCCTGTTTAGATCACGGCACTACTGTAAAC Pattern: TAAATCACGGCATACT k = 2 Example.

9 Previous Results J. Kärkkäinen, G. Navarro, and E. Ukkonen. Approximate string matching over Ziv-Lempel compressed text. In Proc. CPM2000. –Dynamic programming technique –O(mkn+R) worst case, O(k 2 n+R) average case T. Matsumoto, T. Kida, M. Takeda, A. Shinohara, and S. Arikawa. Bit-parallel approach to approximate string matching in compressed texts. In Proc. SPIRE2000. –Bit-parallel technique –O(mk 3 n/w) worst case

10 Our Search Approach on LZ78/LZW Introduction –Motivation –Related works and our goal Our search approach on LZ78/LZW –Basic idea –Multiple pattern matching algorithms on compressed text Experimental results Conclusion

11 Basic Idea Filtration technique ( Wu and Manber, 1992 ) –Split the pattern in k+1 equal-length pieces –Find pattern pieces – Multiple pattern matching –Direct verification of candidate text area ( We have chosen Myers’ algorithm ) Text: ACCCTGTTTAGATCACGGCACTACTGTAAAC Pattern pieces: TAAAT, CACGG, CATACT k = 2 Pattern: TAAATCACGGCATACT Example.

12 Why LZ78/LZW? We have already developed a multiple pattern matching algorithm on LZW. Easy to decompress locally.

13 Multiple Pattern Matching Algorithms on Compressed Text Aho-Corasick technique Boyer-Moore technique Bit parallel technique

14 Aho-Corasick Technique T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa, Multiple pattern matching in LZW compressed text. In Proc. DCC’98. Simulate the AC machine Running over LZW directly O(m 2 +n+R) time, O(m 2 +n) space

15 Aho-Corasick Technique ・・ ・・ ・ b1b1 b2b2 b3b3 b4b4 b5b5 b6b6 b7b7 Compressed text: Original text: ・・・・・ CTTAATTAAGCCCCCTGCTAAGCT TTAA A A 6 01 23 4 5 01300501 State transition: Pattern occurrences: TTAA, AA AA : goto function : failure function Patterns: TTAA, AA  /{T,A}

16 Boyer-Moore Technique G. Navarro and J. Tarhio, Boyer-Moore string matching over Ziv-Lempel compressed text. In Proc. CPM2000. Y. Shibata, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa, A Boyer-Moore type algorithm for compressed pattern matching, In Proc. CPM2000. T. Kida et al. Multiple Pattern Matching Algorithms on Collage System In Proc. CPM2001, to appear.

17 Boyer-Moore Technique 1.Find all occurrences that end in the focused block. 2.Calculate the maximum safe shift . 3.Move focus according to . ・・ ・・ ・ b1b1 b2b2 b3b3 b4b4 b5b5 b6b6 b7b7 Compressed text: Original text: ・・・・・ CTTAATTAAGCCCCCTGCTAAGCT Pattern occurrences:

18 Bit Parallel Technique G. Navarro and M. Raffinot, A general practical approach to pattern matching over Ziv- Lempel compressed text. In Proc. CPM’99. T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa, Shift-And approach to pattern matching in LZW compressed text. In Proc. CPM’99.

19 Bit Parallel Technique ・・ ・・ ・ b i-1 bibi b i+1 Compressed text: ・・ ・・ ・ Focused phrase: AAGTTAACTTAAGCCGTT Pattern: TTAA (i) Pattern suffixes(iii) Pattern prefixes (ii) Occurrences inside block b i (i) :=110000000000000000 (ii) :=000000100001000000 (iii) :=000000000000000011 Bit vectors:

20 Experimental Results Introduction –Motivation –Related works and our goal Our search approach on LZ78/LZW –Basic idea –Multiple pattern matching algorithms on compressed text Experimental results Conclusion

21 Experimental Results Intel Pentium III of 550 MHz and 64Mb of RAM running Linux 10Mb of Wall Street Journal articles and 10Mb of DNA data WSJ was compressed to 42.59% of its size and DNA to 27.71%

22 Experimental Results

23

24 Conclusion We applied the filtration technique to compressed texts. We implemented two new multiple pattern matching algorithms on compressed text. –Boyer-Moore type and Bit-parallel type. We showed that this is a practical solution for approximate pattern matching on compressed text. –10-30 times faster than previous solutions. –Up to 3 times faster than decompressing plus searching.


Download ppt "Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa."

Similar presentations


Ads by Google