Presentation is loading. Please wait.

Presentation is loading. Please wait.

Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …

Similar presentations


Presentation on theme: "Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …"— Presentation transcript:

1 Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …

2 Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i <= j of S.  S = cater => ate is a substring.  car is not a substring.  Empty string is a substring of S.

3 Subsequence Subsequence of string S … string composed of characters i 1 < i 2 < … < i k of S.  S = cater => ate is a subsequence.  car is a subsequence.  The empty string is a subsequence.

4 String/Pattern Matching You are given a source string S. Answer queries of the form: is the string p i a substring of S? Knuth-Morris-Pratt (KMP) string matching.  O(|S| + | p i |) time per query.  O(n|S| +  i | p i |) time for n queries. Suffix tree solution.  O(|S| +  i | p i |) time for n queries.

5 String/Pattern Matching KMP preprocesses the query string p i, whereas the suffix tree method preprocesses the source string S. An application of string matching.  Genome project.  Databank of strings (gene sequences).  Character set is ATGC.  Determine if a “new” sequence is a substring of a databank sequence.

6 Definition Of Suffix Tree Compressed trie with edge information. Keys are the nonempty suffixes of a given string S. Nonempty suffixes of S = sleeper are:  sleeper  leeper  eeper  eper  per, er, and r.

7 String Matching & Suffixes p i is a substring of S iff p i is a prefix of some suffix of S. Nonempty suffixes of S = sleeper are:  sleeper  leeper  eeper  eper  per, er, and r. Which of these are substrings of S?  leep, eepe, pe, leap, peel

8 Last Character Of S Repeats When the last character of S appears more than once in S, S has at least one suffix that is a proper prefix of another suffix. S = creeper  creeper, reeper, eeper, eper, per, er, r When the last character of S appears more than once in S, use an end of string character # to overcome this problem. S = creeper#  creeper#, reeper#, eeper#, eper#, per#, er#, r#, #

9 Suffix Tree For S = abbbabbbb# abbb b # abbbb#b# #abbbb# b # # b b# 1 2 3 4 5

10 Suffix Tree For S = abbbabbbb# abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 1 2 3 4 5

11 Suffix Tree For S = abbbabbbb# abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 1 1 4 8 2 1 52 3 4

12 Suffix Tree Construction See Web write up for algorithm. Time complexity  |S| = n, alphabet size = r.  O(nr) using array nodes.  This is O(n) for r a constant (or r <= c).  O(n) expected time using a hash table.  O(n) time algorithm for large r in reference cited in Web write up.

13 Suffix Array Array that contains the start position of suffixes in lexicographic order. abbbabbbb#  Assume # < a < b  # < abbbabbbb# < abbbb# < b# < babbbb# < bb# < bbabbbb# < bbb# < bbbabbbb# < bbbb#  SA = [10, 1, 5, 9, 4, 8, 3, 7, 2, 6]  LCP = length of longest common prefix between adjacent entries of SA.  LCP = [0, 4, 0, 1, 1, 2, 2, 3, 3, -]

14 Suffix Array Less space than suffix tree Linear time construction Can be used to solve several of the problems solved by a suffix tree with same asymptotic complexity.  Substring matching  binary search for p using SA.  O(|p| log |S|).

15 O(|p i |) Time Substring Matching babbabbbababa abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10

16 Find All Occurrences Of p i Search suffix tree for p i. Suppose the search for p i is successful. When search terminates at an element node, p i appears exactly once in the source string S.

17 Search Terminates At Element Node abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 abbbb#

18 Search Terminates At Branch Node When the search for p i terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of p i.

19 Search Terminates At Branch Node abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 ab

20 Find All Occurrences Of p i To find all occurrences of p i in time linear in the length of p i and linear in the number of occurrences of p i, augment suffix tree:  Link all element nodes into a chain in inorder.  Each branch node keeps a pointer to the left most and right most element node in its subtree.

21 Augmented Suffix Tree abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 b

22 Longest Repeating Substring Find longest substring of S that occurs more than m > 1 times in S. Label branch nodes with number of element nodes in subtree. Find branch node with label >= m and max char# field.

23 Longest Repeating Substring abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 m = 2 2 3 5 7 m = 5 10

24 Longest Common Substring Given two strings S and T. Find the longest common substring. S = carport, T = airports  Longest common substring = rport  Longest common subsequence = arport Longest common subsequence may be found in O(|S|*|T|) time using dynamic programming. Longest common substring may be found in O(|S|+|T|) time using a suffix tree.

25 Longest Common Substring Let $ be a new symbol. Construct the suffix tree for the string U = S$T#.  U = carport$airports#  No repeating substring includes $.  Find longest repeating substring that is both to left and right of $. Find branch node that has max char# and has at least one element node in its subtree that represents a suffix that begins in S as well as at least one that begins in T.


Download ppt "Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …"

Similar presentations


Ads by Google