Presentation is loading. Please wait.

Presentation is loading. Please wait.

Suffix Tree 6 Mar. 2004 MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.

Similar presentations


Presentation on theme: "Suffix Tree 6 Mar. 2004 MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring."— Presentation transcript:

1 Suffix Tree 6 Mar. 2004 MinKoo Seo

2 Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring of Two Strings  Circular String Linearlization  Suffix Array  Where do large alphabet problems arise?  Reference

3 Basic Text Searching  Naïve Method  Time Complexity : T: x a b x y a b x y a b x z P: a b x y a b x z * a b x y a b x z ^ ^ ^ ^ ^ ^ ^ * a b x y a b x z * a b x y a b x z

4 Basic Text Searching (cont)  Skip shifts/compares (T:Text, P:Pattern)  First character of P (namely a) does not occur again in P until position 5 of P. T: x a b x y a b x y a b x z P: a b x y a b x z * a b x y a b x z ^ ^ ^ ^ ^ ^ ^ * a b x y a b x z ^ ^ ^ ^ ^ ^ ^ ^

5 Basic Text Searching (cont)  Even smarter algorithm  We already know that “abx” was matched. Therefore, we do not compare these again. T: x a b x y a b x y a b x z P: a b x y a b x z * a b x y a b x z ^ ^ ^ ^ ^ ^ ^ * a b x y a b x z ^ ^ ^ ^ ^

6 Introduction to Suffix Tree  Definition: Suffix Tree  A suffix tree T for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m.  Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S.  No two edges out of a node can have edge-labels beginning with the same character.  The key feature of the suffix tree is that for any leaf i, the concatenation of the edge labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i. That is, it spells out S[i..m].

7 Introduction to Suffix Tree (cont)  Example  S= x a b x a c 1 2 3 4 5 6 4 2 c a c 563 c b x a c b x a c x ab x a c 1

8 A Motivating Example: Suffix Trees and Exact Matching  Exact Matching Algorithm (|T|=m, |P|=n) 1. Build a suffix tree T for text T in O(m) time. 2. Match the characters of P along the unique path in T until  P is exhausted; Every leaf in the subtree below the point of the last match is numbered with a starting location of P in T.  No more matches are possible; This means that P does not appear anywhere in T.  TC  O(m) preprocessing time.  O(n+k) search time where k is the number of occurrences of P in T.

9 A Motivating Example: Suffix Trees and Exact Matching  Example  S= x a b x a c / P= x a 1 2 3 4 5 6 4 2 c a c 563 c b x a c b x a c x ab x a c 1

10 Example: Longest Common Substring of Two Strings  Longest common substring problem  Find the longest substring common to two given Strings S1 and S2.  For example, S1=superiorcalifornialives and S2=sealiver, then the longest common substring of S1 and S2 is alive.  Label of a path  The path label of a node is the label of the path from the root of T to that node.

11 Example: Longest Common Substring of Two Strings (cont)  String-depth  For any node v in a suffix tree, the string-depth of v is the number of characters in v’s label.  Example 4 2 c a c 563 c b x a c b x a c x ab x a c 1 Path label: xa String Depth: 2

12 Example: Longest Common Substring of Two Strings (cont)  Longest common substring problem using generalized suffix tree  Each leaf of the tree represents either a suffix from one of the two strings or a suffix that occurs in both the strings. Mark each internal node v with a 1(or 2) if there is a leaf in the subtree of v representing a suffix from S1(or S2). The path-label of any internal node marked both 1 and 2 is a substring common to both S1 and S2.  The algorithm has to find the node with the greatest string-depth that is marked both 1 and 2.  Construction can be done in linear time. Node marking and calculations of string-depth can be done by standard linear-time traversal methods.

13 Example: Longest Common Substring of Two Strings (cont)  Example  S1= x a b x a c / S2= a x a y z 1 2 3 4 5 6 1 2 3 4 5 1,4 1,2 c a c 1,51, 6 1,3 c b x a c b x a c x b x a c 1,1 x a y z 2,1 y z 2,2 y z 2,3 y z 2,4 a z 2,5 1,2

14 Example: Circular String Linearlization  Circular String  A circular string of length n is a string in which character n is considered to precede character 1.  Example a a a x a c d y x a z c

15 Example: Circular String Linearlization (cont)  Lexically smaller than  Given an ordering of the characters in the alphabet, a string S1 is lexically (or lexicographically) smaller than a string S2 if S1 would appear before S2 in a normal dictionary ordering of the two strings.  Circular String Linearlization Problem  Choose a place to cut S so that the resulting linear string is the lexically smallest of all the n possible linear strings created by cutting S.

16 Example: Circular String Linearlization (cont)  Example  Motivation of Circular String Linearlization  This problem arises in chemical databases for circular molecules. To allow faster lookup and comparisons of molecules, one wants to store each circular string by a canonical linear string. A natural choice for canonical linear string is the one that is lexically least. a a a x a c d y x a z c a a a x a c d y x a z c

17 Example: Circular String Linearlization (cont)  Solution via suffix trees 1. Arbitrarily cut the circular string S, giving a linear string L. 2. Double L, creating the string LL, and build the suffix tree T for LL. 3. Affix the terminal symbol $ at the end of LL, but interpret it to be lexically greater than any other character in the alphabet. 4. Traverse tree T with the rule that, at every node, the traversal follows the edge whose first character is lexically smallest over all first characters on edges out of the node. 5. This traversal continues until the traversed path has string-depth n.  This method runs in linear time and is therefore optimal.

18 Example: Circular String Linearlization (cont)  Example b a c d L = c d b a LL$ = c d b a c d b a $ 1 2 3 4 5 6 7 8 9 c d b a c d b a $ 1 $ 5 d b a c d b a $ 2 b a c d b a $ 3 a c d b a $ 4 $ 6 $ 7 $ 8 $ 9 a c d b

19 Suffix Array  When alphabet size is included in the time and space bounds, a suffix tree may require too much space to be practical in some applications.  Suffix Array  Very space efficient and yet can be used to solve the exact matching problem or the substring problem almost as efficiently as with a suffix tree.  Likely to be an important contribution to certain string problems in computational biology, where the alphabet can be very large.

20 Suffix Array (cont)  Suffix Array  Given an m-character string T, a suffix array for T, called Pos, is an array of integers in the range 1 to m, specifying the lexicographic order of the m suffixes of string T.  That is, the suffix starting at position Pos(1) of T is the lexically smallest suffix, and in general suffix Pos(i) of T is lexically smaller than suffix Pos(i+1).

21 Suffix Array (cont)  Example  If T is Mississippi 11:i 8:ippi 5:issippi 2:ississippi 1:mississippi 10:pi 9:ppi 7:sippi 4:sisippi 6:ssippi 3:ssissippi

22 Suffix Array (cont)  Notice that the suffix array holds only integers and hence contains no information about the alphabet used in string T.  Therefore, the space required by suffix arrays is modest.

23 Where do large alphabet problems arise?  Motivation of suffix arrays  Suffix arrays come from problems that arise in using suffix trees when the underlying alphabet is large.  Largest alphabet problems  Natural language, such as Chinese  String contains numbers, each of which is treated as a character

24 Where do large alphabet problems arise? (cont)  Map matching problem  A restriction enzyme map for a single enzyme specifies the locations in a DNA string where copies of a certain substring occurs.  The restriction enzyme map for that single enzyme is represented as a string consisting of a sequence of integers specifying the distances between successive enzyme sites. More generally, a map may display the sites of many different patterns of interest.  The alphabet is huge because the range of integers is huge, and since distances are often known with high precision, the numbers are not rounded off.  In addition, most DNA substrings that are studied are not fully sequenced – maps are easier and cheaper than sequences.

25 Reference  "Algorithms on Strings, Trees, and Sequences" by Dan Gusfield, Cambridge University Press


Download ppt "Suffix Tree 6 Mar. 2004 MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring."

Similar presentations


Ads by Google