Presentation is loading. Please wait.

Presentation is loading. Please wait.

Suffix Array: Data structures and applications Zhifeng Liu School of Computer Science, Univ. of Waterloo Dec. 06, 2004.

Similar presentations


Presentation on theme: "Suffix Array: Data structures and applications Zhifeng Liu School of Computer Science, Univ. of Waterloo Dec. 06, 2004."— Presentation transcript:

1 Suffix Array: Data structures and applications Zhifeng Liu School of Computer Science, Univ. of Waterloo Dec. 06, 2004

2 Outline Introduction Suffix array and enhanced suffix array An example - Is P a substring of S? Conclusions References

3 Introduction Why suffix array? Suffix trees drawbacks : Space consumption: 20n (n=|S|, string length) [Kur99] Memory Locality: Loss of efficiency Suffix array (PAT array): Manber & Myers[Man93] also Gonnet & Baeza-Yates[Gon93]

4 Suffix array Definition & an example Informal Definition : same information as a suffix tree but more compact. Suffixes in an alphabetic order Example: the suffix array for banana# is: # a# ana# anana# banana# na# nana# From Prof. Browns Assign 2s Handout

5 Suffix array isnt perfect either Less space: 4n but Less space: 4n but direct constructing time: O(nlogn) direct constructing time: O(nlogn) Linear constructing time via suffix tree but sacrifices space Linear constructing time via suffix tree but sacrifices space Binary search for a substring P takes O(mlogn) (m=|P|) Binary search for a substring P takes O(mlogn) (m=|P|) So enhanced suffix array!

6 Enhanced Suffix Array =suffix array+ additional tables isuftab lcptab S suftab[i] 020 aaacatat$ 132 aacatat$ 201 acaaacatat$ 343 acatat$ 461 atat$ 582at$ 610 caaacatat$ 752 catat$ 870tat$ 991t$ 10 0$ Fig1 The enhanced array for S=acaaacatat$ and its lcp-interval tree Adapted from [Abo04] 0-[0,10] 1-[0,5]2-[6,7]1-[8,9] 2-[0..1]3-[2..3]2-[4..5]

7 10 ca a t $ a t a..$ c..$a..$ t..$ a..$ t..$ a..$ t..$ a..$ $ Fig2 The lcp-interval tree vs suffix tree for S=acaaacatat$ 0-[0,10] 1-[0,5]2-[6,7]1-[8,9] 2-[0..1]3-[2..3]2-[4..5] Enhanced Suffix Array(2)

8 Enhanced Suffix Array(3) -the more tables, the more likely a suffix tree? 0-[0,10] 1-[0,5]2-[6,7]1-[8,9] 2-[0..1]3-[2..3]2-[4..5] Fig3 ChildTab records the linked relationship in the lcp-interval tree ChildTab: Up, down and next fields to record the parent- child,sibling relationships. The lcp-interval tree is like a suffix tree. However, it is virtual but can simulate suffix tree traversal efficiently.

9 Enhanced suffix array replaces suffix tree Every algorithms using suffix tree can be systematically replaced by (enhanced) suffix array in the same time complexity Bottom-up traversal of suffix tree ->suffix array with lcptab and lcp-interval tree Top-down traversal of suffix tree->suffix array with childtab Answer Decision Query

10 Answer Decision Queries Algorithm Answering decision queries c := 0 queryFound := true (i, j ) := getInterval(0,n,P[c]) while (i, j ) <> and c j then l := getlcp(i, j ) min := min{l, m} queryFound := S[suftab[i]+ c..suftab[i]+min 1] = P[c..min 1] c := min (i, j ) := getInterval(i, j,P[c]) else queryFound := S[suftab[i]+ c..suftab[i]+ m 1] = P[c..m 1] if queryFound then Report [i, j] as a occurrence of P else print(P is not found in S)

11 Answer Decision Queries (contd) S suftab[i] aaacatat$ aacatat$ acaaacatat$ acatat$ atat$ at$ caaacatat$ catat$ tat$ t$ $ P=cb P=caaa Longest common string

12 Additional tables eat too much space? There are tricks to reduce space requirements. If string length n=|S| <2 32,each integer index needs 4 bytes. suftab needs 4n; lcptab also needs 4n? No! Usually only a few entries in lcptab >255. So Store each entry in lcptab with 1 byte and allocate another table for long-lcp-values Space saved, time efficiency reserved though the worst-case time complexity may be affected

13 Conclusions Suffix array: there is always a tension between space and speed. Research tries to release the tension; Suffix array can replace with suffix tree; Suffix array is practical: Faster and easier to implement

14 References [Abo04] Mohamed Ibrahim Abouelhoda, Stefan Kurtz, Enno Ohlebusch, Replacing suffix trees with enhanced suffix arrays, Journal of Discrete Algorithms, Volume 2, Issue 1 (March 2004) p [Abo02A] Mohamed Ibrahim Abouelhoda, Stefan Kurtz, Enno Ohlebusch, The Enhanced Suffix Array and Its Applications to Genome Analysis, Proceedings of the Second International Workshop on Algorithms in Bioinformatics, September ,2002, p [Abo02B] Mohamed Ibrahim Abouelhoda, Enno Ohlebusch, Stefan Kurtz, Optimal Exact Strring Matching Based on Suffix Arrays, Proceedings of the 9th International Symposium on String Processing and Information Retrieval, p.31-43, September 11-13, 2002 [Gon92] Gaston H. Gonnet, Ricardo A. Baeza-Yates, Tim Snider, New indices for text: PAT Trees and PAT arrays, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, 1992 [Kur99] S. Kurtz, Reducing the space requirement of suffix trees, SoftwarePractice and Experience 29 (13) (1999) 1149–1171. [Man93] Udi Manber, Gene Myers, Suffix arrays: a new method for on-line string searches, SIAM Journal on Computing, v.22 n.5, p , Oct. 1993


Download ppt "Suffix Array: Data structures and applications Zhifeng Liu School of Computer Science, Univ. of Waterloo Dec. 06, 2004."

Similar presentations


Ads by Google