Presentation is loading. Please wait.

Presentation is loading. Please wait.

Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i <= j of S.  S = cater => ate is.

Similar presentations


Presentation on theme: "Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i <= j of S.  S = cater => ate is."— Presentation transcript:

1 Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i <= j of S.  S = cater => ate is a substring.  car is not a substring.  Empty string is a substring of S.

2 Subsequence Subsequence of string S … string composed of characters i 1 < i 2 < … < i k of S.  S = cater => ate is a subsequence.  car is a subsequence.  The empty string is a subsequence.

3 String/Pattern Matching You are given a source string S. Answer queries of the form: is the string p i a substring of S? Knuth-Morris-Pratt (KMP) string matching.  O(|S| + | p i |) time per query.  O(n|S| +  i | p i |) time for n queries. Suffix tree solution.  O(|S| +  i | p i |) time for n queries.

4 String/Pattern Matching KMP preprocesses the query string p i, whereas the suffix tree method preprocesses the source string S. An application of string matching.  Genome project.  Databank of strings (gene sequences).  Character set is ATGF.  Determine if a “new” sequence is a substring of a databank sequence.

5 Definition Of Suffix Tree Compressed trie with edge information. Keys are the nonempty suffixes of a given string S. Nonempty suffixes of S = sleeper are:  sleeper  leeper  eeper  eper  per, er, and r.

6 String Matching & Suffixes p i is a substring of S iff p i is a prefix of some suffix of S. Nonempty suffixes of S = sleeper are:  sleeper  leeper  eeper  eper  per, er, and r. Which of these are substrings of S?  leep, eepe, pe, leap, peel

7 Last Character Of S Repeats When the last character of S appears more than once in S, S has at least one suffix that is a proper prefix of another suffix. S = creeper  creeper, reeper, eeper, eper, per, er, r When the last character of S appears more than once in S, use an end of string character # to overcome this problem. S = creeper#  creeper#, reeper#, eeper#, eper#, per#, er#, r#, #

8 Suffix Tree For S = abbbabbbb# abbb b # abbbb#b# #abbbb# b # # b b# 1 2 3 4 5

9 Suffix Tree For S = abbbabbbb# abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 1 2 3 4 5

10 Suffix Tree For S = abbbabbbb# abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 1 1 4 8 2 1 52 3 4

11 Suffix Tree Construction See Web write up for algorithm. Time complexity  |S| = n, alphabet size = r.  O(nr) using array nodes.  This is O(n) for r a constant (or r <= c).  O(n) expected time using a hash table.  O(n) time algorithm for large r in reference cited in Web write up.

12 O(|p i |) Time Substring Matching babbabbbababa abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10

13 Find All Occurrences Of p i Search suffix tree for p i. Suppose the search for p i is successful. When search terminates at an element node, p i appears exactly once in the source string S.

14 Search Terminates At Element Node abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 abbbb#

15 Search Terminates At Branch Node When the search for p i terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of p i.

16 Search Terminates At Branch Node abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 ab

17 Find All Occurrences Of p i To find all occurrences of p i in time linear in the length of p i and linear in the number of occurrences of p i, augment suffix tree:  Link all element nodes into a chain in inorder.  Each branch node keeps a pointer to the left most and right most element node in its subtree.

18 Augmented Suffix Tree abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 b

19 Longest Repeating Substring Find longest substring of S that occurs more than m > 1 times in S. Label branch nodes with number of element nodes in subtree. Find branch node with label >= m and max char# field.

20 Longest Repeating Substring abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 m = 2 2 3 5 7 m = 5 10

21 Longest Common Substring Given two strings S and T. Find the longest common substring. S = carport, T = airports  Longest common substring = rport  Longest common subsequence = arport Longest common subsequence may be found in O(|S|*|T|) time using dynamic programming. Longest common substring may be found in O(|S|+|T|) time using a suffix tree.

22 Longest Common Substring Let $ be a new symbol. Construct the suffix tree for the string U = S$T#.  U = carport$airports#  No repeating substring includes $.  Find longest repeating substring that is both to left and right of $. Find branch node that has max char# and has at least one element node in its subtree that represents a suffix that begins in S as well as at least one that begins in T.


Download ppt "Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i <= j of S.  S = cater => ate is."

Similar presentations


Ads by Google