Strings: Tries, Suffix Trees

Strings: Tries, Suffix Trees

Trie (Prefix tree) Trie: a tree for entering (usually) strings. Each edge gives one letter. A string is denoted by a node indicating it ends a string. e.g.: cat car call dog dag do Designate end nodes Indicate end of a string All leaves are end nodes c d l a g t r o

Tries Compared to hash tables for strings:
No hash function needed No need for chaining/collision handling Can maintain alphabetical ordering Like hash tables, works on other data besides strings But, might not work as well in those cases Efficiency vs. hash tables depends on how the structures end up being stored in memory, caches, etc. Generally, hash tables are likely to be more efficient Tries are better in worst case At terminating node, could store more information (link to other data, for instance)

Suffix Trie A trie where you enter all suffixes of a word.
Example: “reverse” reverse everse verse erse rse se e Suffix Tries allow subsequent faster processing for various tasks Though you need to build it first – this can take longer overall for some tasks

Suffix Trie e r v s a r v e s e e s e v e r r e e s s r e e s e

Suffix Tree from a Suffix Trie
Suffix Tries can tend to have long “chains” of nodes This is inefficient Instead of one letter per edge, create strings of letters per edge Can still split an edge into two if there’s a difference part-way through Since string comparison involves comparing letter-by-letter, not wasting any time when doing this. Instead of O(n2) nodes for a suffix trie, now you have at most 2n nodes in the suffix tree.

Suffix Trie e r verse se a rse verse everse se

Using a suffix trie/tree
To find if a string is a substring of a particular string Search in the suffix tree – the string must begin matching some suffix Do not have to end at a terminating node – as long as all intermediate edges are there, the string is a substring. Count matching substrings Make sure to keep a count in nodes of how many are that one or below. Longest repeated substring Find the deepest INTERNAL node of the suffix tree (not trie) Internal nodes must be a prefix string for 2 or more suffixes Longest common substring (not subsequence) for two strings Create a joint suffix tree Mark each node as having subnodes from one or other or both strings Longest is deepest node marked with both substrings.

Suffix Arrays Suffix trees can be constructed in O(n), but the algorithm is complicated; not good for fast/accurate coding Suffix Arrays can provide nearly as good operations, and are much simpler to implement. Idea: suffixes are originally numbered 0..n. Sort (the indices) alphabetically.

reverse everse verse erse rse se e 8: 7: e 4: erse 2: everse 1: reverse 5: rse 6: se 3: verse

Implementing Suffix Array (1)
Let original string be S Initialize SA[i] = i for all n suffixes Sort SA, sort (SA, SA+n, cmp) where comparison is: cmp(int a, int b) { return strcmp(S+a, S+b) < 0; } i.e. if string starting at a is less than string starting at b, then comparison is true Unfortunately, though super-easy to code, this takes too long for long strings (> 1000 or so) due to length of string comparison strcmp string comparison is O(n), so overall is O(n2lgn)

Implementing Suffix Array (2)
Can improve performance by limiting sorting range First sort just first letter Then sort by first and second Then by first through fourth Then by first through eighth etc. (by powers of 2) See book for very efficient code for doing this Also more detailed explanation of process Requires using a stable sort (counting sort is OK!) Can be a linear time sort End result is O(nlgn), so long strings are possible

Using a Suffix Array All require computing the Suffix Array first
Finding occurrences of a substring Binary search to find the suffix before, and the suffix after the target string All those in between will be matches O(mlgn) for finding substring of length m.

Finding occurrences of a substring Longest Common Prefix between suffixes (in suffix array order) Binary search on letters: first letter, then second, etc. Each letter narrows the range to search in subsequently Amortized analysis: O(n)

Finding occurrences of a substring Longest Common Prefix between suffixes (in suffix array order) Longest repeated substring Once you have LCP, just find the maximum LCP value encountered. No more time than computing LCP.

Finding occurrences of a substring Longest Common Prefix between suffixes (in suffix array order) Longest repeated substring Longest common substring between two strings Concatenate one string on the end of the other Put a unique terminating character between them. Then, find longest repeated substring, but with the two substrings (two adjacent entries in suffix array) from different strings.

Strings: Tries, Suffix Trees

Similar presentations

Presentation on theme: "Strings: Tries, Suffix Trees"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Strings: Tries, Suffix Trees

Similar presentations

Presentation on theme: "Strings: Tries, Suffix Trees"— Presentation transcript:

Similar presentations

About project

Feedback