Download presentation

Presentation is loading. Please wait.

Published byAlexis Flanagan Modified over 3 years ago

1
Suffix Array: Data structures and applications Zhifeng Liu School of Computer Science, Univ. of Waterloo Dec. 06, 2004

2
Outline Introduction Suffix array and enhanced suffix array An example - Is P a substring of S? Conclusions References

3
Introduction Why suffix array? Suffix trees drawbacks : Space consumption: 20n (n=|S|, string length) [Kur99] Memory Locality: Loss of efficiency Suffix array (PAT array): Manber & Myers[Man93] also Gonnet & Baeza-Yates[Gon93]

4
Suffix array Definition & an example Informal Definition : same information as a suffix tree but more compact. Suffixes in an alphabetic order Example: the suffix array for banana# is: # a# ana# anana# banana# na# nana# From Prof. Browns Assign 2s Handout

5
Suffix array isnt perfect either Less space: 4n but Less space: 4n but direct constructing time: O(nlogn) direct constructing time: O(nlogn) Linear constructing time via suffix tree but sacrifices space Linear constructing time via suffix tree but sacrifices space Binary search for a substring P takes O(mlogn) (m=|P|) Binary search for a substring P takes O(mlogn) (m=|P|) So enhanced suffix array!

6
Enhanced Suffix Array =suffix array+ additional tables isuftab lcptab S suftab[i] 020 aaacatat$ 132 aacatat$ 201 acaaacatat$ 343 acatat$ 461 atat$ 582at$ 610 caaacatat$ 752 catat$ 870tat$ 991t$ 10 0$ Fig1 The enhanced array for S=acaaacatat$ and its lcp-interval tree Adapted from [Abo04] 0-[0,10] 1-[0,5]2-[6,7]1-[8,9] 2-[0..1]3-[2..3]2-[4..5]

7
10 ca a t $ a t a..$ c..$a..$ t..$ a..$ t..$ a..$ t..$ a..$ $ Fig2 The lcp-interval tree vs suffix tree for S=acaaacatat$ 0-[0,10] 1-[0,5]2-[6,7]1-[8,9] 2-[0..1]3-[2..3]2-[4..5] Enhanced Suffix Array(2)

8
Enhanced Suffix Array(3) -the more tables, the more likely a suffix tree? 0-[0,10] 1-[0,5]2-[6,7]1-[8,9] 2-[0..1]3-[2..3]2-[4..5] Fig3 ChildTab records the linked relationship in the lcp-interval tree ChildTab: Up, down and next fields to record the parent- child,sibling relationships. The lcp-interval tree is like a suffix tree. However, it is virtual but can simulate suffix tree traversal efficiently.

9
Enhanced suffix array replaces suffix tree Every algorithms using suffix tree can be systematically replaced by (enhanced) suffix array in the same time complexity Bottom-up traversal of suffix tree ->suffix array with lcptab and lcp-interval tree Top-down traversal of suffix tree->suffix array with childtab Answer Decision Query

10
Answer Decision Queries Algorithm Answering decision queries c := 0 queryFound := true (i, j ) := getInterval(0,n,P[c]) while (i, j ) <> and c

11
Answer Decision Queries (contd) S suftab[i] aaacatat$ aacatat$ acaaacatat$ acatat$ atat$ at$ caaacatat$ catat$ tat$ t$ $ P=cb P=caaa Longest common string

12
Additional tables eat too much space? There are tricks to reduce space requirements. If string length n=|S| <2 32,each integer index needs 4 bytes. suftab needs 4n; lcptab also needs 4n? No! Usually only a few entries in lcptab >255. So Store each entry in lcptab with 1 byte and allocate another table for long-lcp-values Space saved, time efficiency reserved though the worst-case time complexity may be affected

13
Conclusions Suffix array: there is always a tension between space and speed. Research tries to release the tension; Suffix array can replace with suffix tree; Suffix array is practical: Faster and easier to implement

14
References [Abo04] Mohamed Ibrahim Abouelhoda, Stefan Kurtz, Enno Ohlebusch, Replacing suffix trees with enhanced suffix arrays, Journal of Discrete Algorithms, Volume 2, Issue 1 (March 2004) p [Abo02A] Mohamed Ibrahim Abouelhoda, Stefan Kurtz, Enno Ohlebusch, The Enhanced Suffix Array and Its Applications to Genome Analysis, Proceedings of the Second International Workshop on Algorithms in Bioinformatics, September ,2002, p [Abo02B] Mohamed Ibrahim Abouelhoda, Enno Ohlebusch, Stefan Kurtz, Optimal Exact Strring Matching Based on Suffix Arrays, Proceedings of the 9th International Symposium on String Processing and Information Retrieval, p.31-43, September 11-13, 2002 [Gon92] Gaston H. Gonnet, Ricardo A. Baeza-Yates, Tim Snider, New indices for text: PAT Trees and PAT arrays, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, 1992 [Kur99] S. Kurtz, Reducing the space requirement of suffix trees, SoftwarePractice and Experience 29 (13) (1999) 1149–1171. [Man93] Udi Manber, Gene Myers, Suffix arrays: a new method for on-line string searches, SIAM Journal on Computing, v.22 n.5, p , Oct. 1993

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google