# Suffix Array: Data structures and applications

## Presentation on theme: "Suffix Array: Data structures and applications"— Presentation transcript:

Suffix Array: Data structures and applications
Zhifeng Liu School of Computer Science, Univ. of Waterloo Dec. 06, 2004

Outline Introduction Suffix array and enhanced suffix array
An example - Is P a substring of S? Conclusions References

Introduction Why suffix array?
Suffix tree’s drawbacks : Space consumption: 20n (n=|S|, string length) [Kur99] Memory Locality: Loss of efficiency Suffix array (PAT array): Manber & Myers[Man93] also Gonnet & Baeza-Yates[Gon93]

Suffix array Definition & an example
Informal Definition: same information as a suffix tree but more compact. Suffixes in an alphabetic order Example: the suffix array for banana# is: # a# ana# anana# banana# na# nana# From Prof. Brown’s Assign 2’s Handout

Suffix array isn’t perfect either
Less space: 4n but direct constructing time: O(nlogn) Linear constructing time via suffix tree but sacrifices space Binary search for a substring P takes O(mlogn) (m=|P|) So enhanced suffix array!

Enhanced Suffix Array =suffix array+ additional tables
suftab lcptab Ssuftab[i] 2 aaacatat\$ 1 3 aacatat\$ acaaacatat\$ 4 acatat\$ 6 atat\$ 5 8 at\$ caaacatat\$ 7 catat\$ tat\$ 9 t\$ 10 \$ 0-[0,10] 1-[0,5] 2-[6,7] 1-[8,9] 2-[0..1] 3-[2..3] 2-[4..5] Fig1 The enhanced array for S=acaaacatat\$ and its lcp-interval tree Adapted from [Abo04]

Enhanced Suffix Array(2)
0-[0,10] 1-[0,5] 2-[6,7] 1-[8,9] 2-[0..1] 3-[2..3] 2-[4..5] 10 ca a t \$ 1 5 7 9 2 3 4 6 8 a..\$ c..\$ t..\$ Fig2 The lcp-interval tree vs suffix tree for S=acaaacatat\$

Enhanced Suffix Array(3) -the more tables, the more likely a suffix tree?
0-[0,10] 1-[0,5] 2-[6,7] 1-[8,9] 2-[0..1] 3-[2..3] 2-[4..5] ChildTab: Up, down and next fields to record the parent-child,sibling relationships. The lcp-interval tree is like a suffix tree. However, it is virtual but can simulate suffix tree traversal efficiently. Fig3 ChildTab records the linked relationship in the lcp-interval tree

Enhanced suffix array replaces suffix tree
Every algorithms using suffix tree can be systematically replaced by (enhanced) suffix array in the same time complexity Bottom-up traversal of suffix tree ->suffix array with lcptab and lcp-interval tree Top-down traversal of suffix tree->suffix array with childtab Answer Decision Query

Algorithm Answering decision queries c := 0 queryFound := true (i, j ) := getInterval(0,n,P[c]) while (i, j ) <>⊥ and c<m and queryFound = True if i <> j then l := getlcp(i, j ) min := min{l, m} queryFound := S[suftab[i]+ c..suftab[i]+min − 1] = P[c..min− 1] c := min (i, j ) := getInterval(i, j,P[c]) else queryFound := S[suftab[i]+ c..suftab[i]+ m− 1] = P[c..m− 1] if queryFound then Report [i, j] as a occurrence of P else print(P is not found in S)

Ssuftab[i] aaacatat\$ aacatat\$ acaaacatat\$ acatat\$ atat\$ at\$ caaacatat\$ catat\$ tat\$ t\$ \$ P=cb P=caaa Longest common string

Additional tables eat too much space?
There are tricks to reduce space requirements. If string length n=|S| <232,each integer index needs 4 bytes. suftab needs 4n; lcptab also needs 4n? No! Usually only a few entries in lcptab >255. So Store each entry in lcptab with 1 byte and allocate another table for long-lcp-values Space saved, time efficiency reserved though the worst-case time complexity may be affected

Conclusions Suffix array: there is always a tension between space and speed. Research tries to release the tension; Suffix array can replace with suffix tree; Suffix array is practical: Faster and easier to implement

References [Abo04] Mohamed Ibrahim Abouelhoda , Stefan Kurtz , Enno Ohlebusch, Replacing suffix trees with enhanced suffix arrays, Journal of Discrete Algorithms, Volume 2,  Issue 1  (March 2004) p.54-86 [Abo02A] Mohamed Ibrahim Abouelhoda , Stefan Kurtz , Enno Ohlebusch, The Enhanced Suffix Array and Its Applications to Genome Analysis, Proceedings of the Second International Workshop on Algorithms in Bioinformatics, September 17-21,2002, p [Abo02B] Mohamed Ibrahim Abouelhoda , Enno Ohlebusch , Stefan Kurtz, Optimal Exact Strring Matching Based on Suffix Arrays, Proceedings of the 9th International Symposium on String Processing and Information Retrieval, p.31-43, September 11-13, 2002 [Gon92] Gaston H. Gonnet , Ricardo A. Baeza-Yates , Tim Snider, New indices for text: PAT Trees and PAT arrays, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, 1992 [Kur99] S. Kurtz, Reducing the space requirement of suffix trees, Software—Practice and Experience 29 (13) (1999) 1149–1171. [Man93] Udi Manber , Gene Myers, Suffix arrays: a new method for on-line string searches, SIAM Journal on Computing, v.22 n.5, p , Oct. 1993

Similar presentations