Presentation is loading. Please wait.

Presentation is loading. Please wait.

Suffix Trees String … any sequence of characters.

Similar presentations


Presentation on theme: "Suffix Trees String … any sequence of characters."— Presentation transcript:

1 Suffix Trees String … any sequence of characters.
Substring of string S … string composed of characters i through j, i <= j of S. S = cater => ate is a substring. car is not a substring. Empty string is a substring of S.

2 Subsequence Subsequence of string S … string composed of characters i1 < i2 < … < ik of S. S = cater => ate is a subsequence. car is a subsequence. The empty string is a subsequence.

3 String/Pattern Matching
You are given a source string S. Answer queries of the form: is the string pi a substring of S? Knuth-Morris-Pratt (KMP) string matching. O(|S| + | pi |) time per query. O(n|S| + Si | pi |) time for n queries. Suffix tree solution. O(|S| + Si | pi |) time for n queries. Very significant run-time reduction in situations where S is very long and the pis are very short.

4 String/Pattern Matching
KMP preprocesses the query string pi, whereas the suffix tree method preprocesses the source string S. An application of string matching. Genome project. Databank of strings (gene sequences). Character set is ATGF. Determine if a “new” sequence is a substring of a databank sequence.

5 Definition Of Suffix Tree
Compressed trie with edge information. Keys are the nonempty suffixes of a given string S. Nonempty suffixes of S = sleeper are: sleeper leeper eeper eper per, er, and r.

6 String Matching & Suffixes
pi is a substring of S iff pi is a prefix of some suffix of S. Nonempty suffixes of S = sleeper are: sleeper leeper eeper eper per, er, and r. Which of these are substrings of S? leep, eepe, pe, leap, peel

7 Last Character Of S Repeats
When the last character of S appears more than once in S, S has at least one suffix that is a proper prefix of another suffix. S = creeper creeper, reeper, eeper, eper, per, er, r When the last character of S appears more than once in S, use an end of string character # to overcome this problem. S = creeper# creeper#, reeper#, eeper#, eper#, per#, er#, r#, # The length 1 suffix (i.e., last character) is a proper prefix of the suffix that begins at each of the other occurrences of this last character.

8 Suffix Tree For S = abbbabbbb#
1 2 3 4 5 Edges labeled with the branch character plus skipped over characters. When # added, char# of root is always 1.

9 Suffix Tree For S = abbbabbbb#
1 abbb b # abbbb# b# 5 2 10 3 1 5 9 4 4 8 3 Element nodes index into the string rather than keep complete suffixes. The index is to the first character of the suffix that would otherwise be in the element node. abbbabbbb# 7 2 6

10 Suffix Tree For S = abbbabbbb#
1 1 abbb b # abbbb# b# 5 4 2 10 1 3 8 1 5 9 4 4 2 8 3 Edge information (edge is labeled with branch character plus skipped over characters) is extracted using an index in the child branch/element node. The branch node index is the same as that in any descendent element node (i.e., first char of any suffix in the subtree). Use the indicated character in search string to reach branch node; use index in reached branch node to figure out skipped characters. Need to check characters from previous node char# to current node char#-1. Note that by simply looking at the char#s on path from root, you can’t figure out what the edge info is because you can’t tell where you are in the string S. abbbabbbb# 7 2 6

11 Suffix Tree Construction
See Web write up for algorithm. Time complexity |S| = n, alphabet size = r. O(nr) using array nodes. This is O(n) for r a constant (or r <= c). O(n) expected time using a hash table. O(n) time algorithm for large r in reference cited in Web write up.

12 O(|pi|) Time Substring Matching
abbb b # abbbb# b# abbbabbbb# 1 5 4 3 2 6 7 8 9 10 babb abbba baba

13 Find All Occurrences Of pi
Search suffix tree for pi. Suppose the search for pi is successful. When search terminates at an element node, pi appears exactly once in the source string S.

14 Search Terminates At Element Node
abbb b # abbbb# b# abbbabbbb# 1 5 4 3 2 6 7 8 9 10 abbbb#

15 Search Terminates At Branch Node
When the search for pi terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of pi.

16 Search Terminates At Branch Node
abbb b # abbbb# b# abbbabbbb# 1 5 4 3 2 6 7 8 9 10 ab

17 Find All Occurrences Of pi
To find all occurrences of pi in time linear in the length of pi and linear in the number of occurrences of pi, augment suffix tree: Link all element nodes into a chain in inorder. Each branch node keeps a pointer to the left most and right most element node in its subtree.

18 Augmented Suffix Tree abbbabbbb# 12345678910 b abbb b # abbbb# b# 1 5

19 Longest Repeating Substring
Find longest substring of S that occurs more than m > 1 times in S. Label branch nodes with number of element nodes in subtree. Find branch node with label >= m and max char# field.

20 Longest Repeating Substring
10 abbb b # abbbb# b# abbbabbbb# 1 5 4 3 2 6 7 8 9 10 2 7 5 3 Circled values are number of suffixes (excluding #) in the subtree. m = 2 m = 5

21 Longest Common Substring
Given two strings S and T. Find the longest common substring. S = carport, T = airports Longest common substring = rport Longest common subsequence = arport Longest common subsequence may be found in O(|S|*|T|) time using dynamic programming. Longest common substring may be found in O(|S|+|T|) time using a suffix tree.

22 Longest Common Substring
Let $ be a new symbol. Construct the suffix tree for the string U = S$T#. U = carport$airports# No repeating substring includes $. Find longest repeating substring that is both to left and right of $. Find branch node that has max char# and has at least one element node in its subtree that represents a suffix that begins in S as well as at least one that begins in T.


Download ppt "Suffix Trees String … any sequence of characters."

Similar presentations


Ads by Google