Download presentation

Presentation is loading. Please wait.

Published byAlexis Flanagan Modified over 4 years ago

1
**Suffix Array: Data structures and applications**

Zhifeng Liu School of Computer Science, Univ. of Waterloo Dec. 06, 2004

2
**Outline Introduction Suffix array and enhanced suffix array**

An example - Is P a substring of S? Conclusions References

3
**Introduction Why suffix array?**

Suffix tree’s drawbacks : Space consumption: 20n (n=|S|, string length) [Kur99] Memory Locality: Loss of efficiency Suffix array (PAT array): Manber & Myers[Man93] also Gonnet & Baeza-Yates[Gon93]

4
**Suffix array Definition & an example**

Informal Definition: same information as a suffix tree but more compact. Suffixes in an alphabetic order Example: the suffix array for banana# is: # a# ana# anana# banana# na# nana# From Prof. Brown’s Assign 2’s Handout

5
**Suffix array isn’t perfect either**

Less space: 4n but direct constructing time: O(nlogn) Linear constructing time via suffix tree but sacrifices space Binary search for a substring P takes O(mlogn) (m=|P|) So enhanced suffix array!

6
**Enhanced Suffix Array =suffix array+ additional tables**

suftab lcptab Ssuftab[i] 2 aaacatat$ 1 3 aacatat$ acaaacatat$ 4 acatat$ 6 atat$ 5 8 at$ caaacatat$ 7 catat$ tat$ 9 t$ 10 $ 0-[0,10] 1-[0,5] 2-[6,7] 1-[8,9] 2-[0..1] 3-[2..3] 2-[4..5] Fig1 The enhanced array for S=acaaacatat$ and its lcp-interval tree Adapted from [Abo04]

7
**Enhanced Suffix Array(2)**

0-[0,10] 1-[0,5] 2-[6,7] 1-[8,9] 2-[0..1] 3-[2..3] 2-[4..5] 10 ca a t $ 1 5 7 9 2 3 4 6 8 a..$ c..$ t..$ Fig2 The lcp-interval tree vs suffix tree for S=acaaacatat$

8
**Enhanced Suffix Array(3) -the more tables, the more likely a suffix tree?**

0-[0,10] 1-[0,5] 2-[6,7] 1-[8,9] 2-[0..1] 3-[2..3] 2-[4..5] ChildTab: Up, down and next fields to record the parent-child,sibling relationships. The lcp-interval tree is like a suffix tree. However, it is virtual but can simulate suffix tree traversal efficiently. Fig3 ChildTab records the linked relationship in the lcp-interval tree

9
**Enhanced suffix array replaces suffix tree**

Every algorithms using suffix tree can be systematically replaced by (enhanced) suffix array in the same time complexity Bottom-up traversal of suffix tree ->suffix array with lcptab and lcp-interval tree Top-down traversal of suffix tree->suffix array with childtab Answer Decision Query

10
**Answer Decision Queries**

Algorithm Answering decision queries c := 0 queryFound := true (i, j ) := getInterval(0,n,P[c]) while (i, j ) <>⊥ and c<m and queryFound = True if i <> j then l := getlcp(i, j ) min := min{l, m} queryFound := S[suftab[i]+ c..suftab[i]+min − 1] = P[c..min− 1] c := min (i, j ) := getInterval(i, j,P[c]) else queryFound := S[suftab[i]+ c..suftab[i]+ m− 1] = P[c..m− 1] if queryFound then Report [i, j] as a occurrence of P else print(P is not found in S)

11
**Answer Decision Queries (cont’d)**

Ssuftab[i] aaacatat$ aacatat$ acaaacatat$ acatat$ atat$ at$ caaacatat$ catat$ tat$ t$ $ P=cb P=caaa Longest common string

12
**Additional tables eat too much space?**

There are tricks to reduce space requirements. If string length n=|S| <232,each integer index needs 4 bytes. suftab needs 4n; lcptab also needs 4n? No! Usually only a few entries in lcptab >255. So Store each entry in lcptab with 1 byte and allocate another table for long-lcp-values Space saved, time efficiency reserved though the worst-case time complexity may be affected

13
Conclusions Suffix array: there is always a tension between space and speed. Research tries to release the tension; Suffix array can replace with suffix tree; Suffix array is practical: Faster and easier to implement

14
References [Abo04] Mohamed Ibrahim Abouelhoda , Stefan Kurtz , Enno Ohlebusch, Replacing suffix trees with enhanced suffix arrays, Journal of Discrete Algorithms, Volume 2, Issue 1 (March 2004) p.54-86 [Abo02A] Mohamed Ibrahim Abouelhoda , Stefan Kurtz , Enno Ohlebusch, The Enhanced Suffix Array and Its Applications to Genome Analysis, Proceedings of the Second International Workshop on Algorithms in Bioinformatics, September 17-21,2002, p [Abo02B] Mohamed Ibrahim Abouelhoda , Enno Ohlebusch , Stefan Kurtz, Optimal Exact Strring Matching Based on Suffix Arrays, Proceedings of the 9th International Symposium on String Processing and Information Retrieval, p.31-43, September 11-13, 2002 [Gon92] Gaston H. Gonnet , Ricardo A. Baeza-Yates , Tim Snider, New indices for text: PAT Trees and PAT arrays, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, 1992 [Kur99] S. Kurtz, Reducing the space requirement of suffix trees, Software—Practice and Experience 29 (13) (1999) 1149–1171. [Man93] Udi Manber , Gene Myers, Suffix arrays: a new method for on-line string searches, SIAM Journal on Computing, v.22 n.5, p , Oct. 1993

Similar presentations

OK

Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002.

Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google