Suffix Array: Data structures and applications Zhifeng Liu School of Computer Science, Univ. of Waterloo Dec. 06, 2004

Outline Introduction Suffix array and enhanced suffix array An example - Is P a substring of S? Conclusions References

Introduction Why suffix array? Suffix trees drawbacks : Space consumption: 20n (n=|S|, string length) [Kur99] Memory Locality: Loss of efficiency Suffix array (PAT array): Manber & Myers[Man93] also Gonnet & Baeza-Yates[Gon93]

Suffix array Definition & an example Informal Definition : same information as a suffix tree but more compact. Suffixes in an alphabetic order Example: the suffix array for banana# is: # a# ana# anana# banana# na# nana# From Prof. Browns Assign 2s Handout

Suffix array isnt perfect either Less space: 4n but Less space: 4n but direct constructing time: O(nlogn) direct constructing time: O(nlogn) Linear constructing time via suffix tree but sacrifices space Linear constructing time via suffix tree but sacrifices space Binary search for a substring P takes O(mlogn) (m=|P|) Binary search for a substring P takes O(mlogn) (m=|P|) So enhanced suffix array!

Enhanced Suffix Array =suffix array+ additional tables isuftab lcptab S suftab[i] 020 aaacatat$ 132 aacatat$ 201 acaaacatat$ 343 acatat$ 461 atat$ 582at$ 610 caaacatat$ 752 catat$ 870tat$ 991t$ 10 0$ Fig1 The enhanced array for S=acaaacatat$ and its lcp-interval tree Adapted from [Abo04] 0-[0,10] 1-[0,5]2-[6,7]1-[8,9] 2-[0..1]3-[2..3]2-[4..5]

10 ca a t $ a t a..$ c..$a..$ t..$ a..$ t..$ a..$ t..$ a..$ $ Fig2 The lcp-interval tree vs suffix tree for S=acaaacatat$ 0-[0,10] 1-[0,5]2-[6,7]1-[8,9] 2-[0..1]3-[2..3]2-[4..5] Enhanced Suffix Array(2)

Enhanced Suffix Array(3) -the more tables, the more likely a suffix tree? 0-[0,10] 1-[0,5]2-[6,7]1-[8,9] 2-[0..1]3-[2..3]2-[4..5] Fig3 ChildTab records the linked relationship in the lcp-interval tree ChildTab: Up, down and next fields to record the parent- child,sibling relationships. The lcp-interval tree is like a suffix tree. However, it is virtual but can simulate suffix tree traversal efficiently.

Enhanced suffix array replaces suffix tree Every algorithms using suffix tree can be systematically replaced by (enhanced) suffix array in the same time complexity Bottom-up traversal of suffix tree ->suffix array with lcptab and lcp-interval tree Top-down traversal of suffix tree->suffix array with childtab Answer Decision Query

Answer Decision Queries Algorithm Answering decision queries c := 0 queryFound := true (i, j ) := getInterval(0,n,P[c]) while (i, j ) <> and c

Answer Decision Queries (contd) S suftab[i] aaacatat$ aacatat$ acaaacatat$ acatat$ atat$ at$ caaacatat$ catat$ tat$ t$ $ P=cb P=caaa Longest common string

Additional tables eat too much space? There are tricks to reduce space requirements. If string length n=|S| <2 32,each integer index needs 4 bytes. suftab needs 4n; lcptab also needs 4n? No! Usually only a few entries in lcptab >255. So Store each entry in lcptab with 1 byte and allocate another table for long-lcp-values Space saved, time efficiency reserved though the worst-case time complexity may be affected

Conclusions Suffix array: there is always a tension between space and speed. Research tries to release the tension; Suffix array can replace with suffix tree; Suffix array is practical: Faster and easier to implement

References [Abo04] Mohamed Ibrahim Abouelhoda, Stefan Kurtz, Enno Ohlebusch, Replacing suffix trees with enhanced suffix arrays, Journal of Discrete Algorithms, Volume 2, Issue 1 (March 2004) p [Abo02A] Mohamed Ibrahim Abouelhoda, Stefan Kurtz, Enno Ohlebusch, The Enhanced Suffix Array and Its Applications to Genome Analysis, Proceedings of the Second International Workshop on Algorithms in Bioinformatics, September ,2002, p [Abo02B] Mohamed Ibrahim Abouelhoda, Enno Ohlebusch, Stefan Kurtz, Optimal Exact Strring Matching Based on Suffix Arrays, Proceedings of the 9th International Symposium on String Processing and Information Retrieval, p.31-43, September 11-13, 2002 [Gon92] Gaston H. Gonnet, Ricardo A. Baeza-Yates, Tim Snider, New indices for text: PAT Trees and PAT arrays, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, 1992 [Kur99] S. Kurtz, Reducing the space requirement of suffix trees, SoftwarePractice and Experience 29 (13) (1999) 1149–1171. [Man93] Udi Manber, Gene Myers, Suffix arrays: a new method for on-line string searches, SIAM Journal on Computing, v.22 n.5, p , Oct. 1993

