Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.

Similar presentations


Presentation on theme: "A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo."— Presentation transcript:

1 A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo

2 The Problem Initial Problem  Text searching: Finding occurrences of a pattern string in a large (static) document Solution  Text indexing: Trading space for time New Problem  Succinct Text indexes: Reducing the space cost

3 Pattern Searching Give a text string T of length n and a pattern string P of length m, we look for the occurrences of P in T. Three types of Queries  Existential queries: Does P occur in T?  Cardinality queries: How many times does P occur in T?  Listing queries: Where does P occur in T?

4 Text Indexing Inverted files  Word index  Need to store the text as well as the index Suffix trees  Efficient full-text index  4n lg n to 6n lg n bits! Suffix arrays  n lg n bits in basic form, but  3n lg n bits (with LCP data)

5 Applications Text databases  electronic encyclopedias, dictionaries, books, etc. Web search engines  Google, Altavista, etc. Bioinformatics  gene databases More…

6 Related Work Compressed Suffix Arrays  Grossi & Vitter 2000  Sadakane 2000  Grossi, Gupta & Vitter 2003 FM-index  Ferragina & Manzini 2000 & 2001

7 Assumptions & Notation Alphabet: Σ = {a, b} Text: T[1..n]  T[n] = #, where a < # < b Pattern: P[1..m]

8 Permutations and Suffix Arrays An observation  Permutations: n!  Suffix arrays: 2 n-1  Not all permutations are suffix arrays An example  A suffix array: 4, 7, 5, 1, 8, 3, 6, 2 Text: abbaaba#  A permutation: 4, 7, 1, 5, 8, 2, 3, 6 Not a suffix array of any binary text

9 Two Features of Suffix Arrays Suffix Array Another Permutation 4 7 5 1 8 3 6 2 4 7 1 5 8 2 3 6 Ascending-to-max Non-nesting

10 A Categorization Theorem A permutation is a suffix array iff it is:  Ascending-to-max  Non-nesting An immediate application:  Checking whether a permutation is a suffix array in O(n) time using n + O(1) additional words in memory.

11 Application: Space Efficient Suffix Array Text: abaaabbaaabaabb# 8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14 SA: Ba:Ba: Bb:Bb: 0 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0

12 Basic Searching Algorithm: Answering Cardinality Queries Basic Idea: backward search  Start from the end of the pattern P  For i = m, m-1, …, 1, compute the interval [s, e] of SA whose corresponding suffixes are prefixed with P[i, m] 8 3 9 4 12 1 10 5 13 16 7 2 11 15 6 14 SA: P = aba

13 More Algorithms and Tradeoffs Answering listing queries Speeding up the reporting of Occurrences of Long Patterns Self-indexing Time-space tradeoff: multi-level structure

14 Putting it all together space (bits)pattern searching Index 1n+o(n)O(m) (existential & cardinality queries only) Index 22n+o(n)O(m + occ) (m=Ω(lg 1+ε n)) O(m + occ lg n) (otherwise) Index 3O(n)O(m + occ) (m=Ω(lg 1+ε n)) O(m + occ lg λ n) (otherwise) Three index structures:

15 Conclusion Summary  A theorem that characterizes a permutation as the suffix array of a binary string  An efficient algorithm checking whether a permutation is a suffix array  Three space efficient text indexing methods

16 Conclusions (Continued) Related subsequent work  Generalization to larger alphabets Open problem  O(n)-bits text index supporting searching in O(m+occ) time.

17 Thank You.


Download ppt "A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo."

Similar presentations


Ads by Google