Download presentation

Presentation is loading. Please wait.

Published byAlexis Flanagan Modified over 3 years ago

1
**Suffix Array: Data structures and applications**

Zhifeng Liu School of Computer Science, Univ. of Waterloo Dec. 06, 2004

2
**Outline Introduction Suffix array and enhanced suffix array**

An example - Is P a substring of S? Conclusions References

3
**Introduction Why suffix array?**

Suffix tree’s drawbacks : Space consumption: 20n (n=|S|, string length) [Kur99] Memory Locality: Loss of efficiency Suffix array (PAT array): Manber & Myers[Man93] also Gonnet & Baeza-Yates[Gon93]

4
**Suffix array Definition & an example**

Informal Definition: same information as a suffix tree but more compact. Suffixes in an alphabetic order Example: the suffix array for banana# is: # a# ana# anana# banana# na# nana# From Prof. Brown’s Assign 2’s Handout

5
**Suffix array isn’t perfect either**

Less space: 4n but direct constructing time: O(nlogn) Linear constructing time via suffix tree but sacrifices space Binary search for a substring P takes O(mlogn) (m=|P|) So enhanced suffix array!

6
**Enhanced Suffix Array =suffix array+ additional tables**

suftab lcptab Ssuftab[i] 2 aaacatat$ 1 3 aacatat$ acaaacatat$ 4 acatat$ 6 atat$ 5 8 at$ caaacatat$ 7 catat$ tat$ 9 t$ 10 $ 0-[0,10] 1-[0,5] 2-[6,7] 1-[8,9] 2-[0..1] 3-[2..3] 2-[4..5] Fig1 The enhanced array for S=acaaacatat$ and its lcp-interval tree Adapted from [Abo04]

7
**Enhanced Suffix Array(2)**

0-[0,10] 1-[0,5] 2-[6,7] 1-[8,9] 2-[0..1] 3-[2..3] 2-[4..5] 10 ca a t $ 1 5 7 9 2 3 4 6 8 a..$ c..$ t..$ Fig2 The lcp-interval tree vs suffix tree for S=acaaacatat$

8
**Enhanced Suffix Array(3) -the more tables, the more likely a suffix tree?**

0-[0,10] 1-[0,5] 2-[6,7] 1-[8,9] 2-[0..1] 3-[2..3] 2-[4..5] ChildTab: Up, down and next fields to record the parent-child,sibling relationships. The lcp-interval tree is like a suffix tree. However, it is virtual but can simulate suffix tree traversal efficiently. Fig3 ChildTab records the linked relationship in the lcp-interval tree

9
**Enhanced suffix array replaces suffix tree**

Every algorithms using suffix tree can be systematically replaced by (enhanced) suffix array in the same time complexity Bottom-up traversal of suffix tree ->suffix array with lcptab and lcp-interval tree Top-down traversal of suffix tree->suffix array with childtab Answer Decision Query

10
**Answer Decision Queries**

Algorithm Answering decision queries c := 0 queryFound := true (i, j ) := getInterval(0,n,P[c]) while (i, j ) <>⊥ and c<m and queryFound = True if i <> j then l := getlcp(i, j ) min := min{l, m} queryFound := S[suftab[i]+ c..suftab[i]+min − 1] = P[c..min− 1] c := min (i, j ) := getInterval(i, j,P[c]) else queryFound := S[suftab[i]+ c..suftab[i]+ m− 1] = P[c..m− 1] if queryFound then Report [i, j] as a occurrence of P else print(P is not found in S)

11
**Answer Decision Queries (cont’d)**

Ssuftab[i] aaacatat$ aacatat$ acaaacatat$ acatat$ atat$ at$ caaacatat$ catat$ tat$ t$ $ P=cb P=caaa Longest common string

12
**Additional tables eat too much space?**

There are tricks to reduce space requirements. If string length n=|S| <232,each integer index needs 4 bytes. suftab needs 4n; lcptab also needs 4n? No! Usually only a few entries in lcptab >255. So Store each entry in lcptab with 1 byte and allocate another table for long-lcp-values Space saved, time efficiency reserved though the worst-case time complexity may be affected

13
Conclusions Suffix array: there is always a tension between space and speed. Research tries to release the tension; Suffix array can replace with suffix tree; Suffix array is practical: Faster and easier to implement

14
References [Abo04] Mohamed Ibrahim Abouelhoda , Stefan Kurtz , Enno Ohlebusch, Replacing suffix trees with enhanced suffix arrays, Journal of Discrete Algorithms, Volume 2, Issue 1 (March 2004) p.54-86 [Abo02A] Mohamed Ibrahim Abouelhoda , Stefan Kurtz , Enno Ohlebusch, The Enhanced Suffix Array and Its Applications to Genome Analysis, Proceedings of the Second International Workshop on Algorithms in Bioinformatics, September 17-21,2002, p [Abo02B] Mohamed Ibrahim Abouelhoda , Enno Ohlebusch , Stefan Kurtz, Optimal Exact Strring Matching Based on Suffix Arrays, Proceedings of the 9th International Symposium on String Processing and Information Retrieval, p.31-43, September 11-13, 2002 [Gon92] Gaston H. Gonnet , Ricardo A. Baeza-Yates , Tim Snider, New indices for text: PAT Trees and PAT arrays, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, 1992 [Kur99] S. Kurtz, Reducing the space requirement of suffix trees, Software—Practice and Experience 29 (13) (1999) 1149–1171. [Man93] Udi Manber , Gene Myers, Suffix arrays: a new method for on-line string searches, SIAM Journal on Computing, v.22 n.5, p , Oct. 1993

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google