Presentation is loading. Please wait.

Presentation is loading. Please wait.

Two implementation issues Alphabet size Generalizing to multiple strings.

Similar presentations


Presentation on theme: "Two implementation issues Alphabet size Generalizing to multiple strings."— Presentation transcript:

1 Two implementation issues Alphabet size Generalizing to multiple strings

2 One way to compute Use a different end character $ i for each string S i Concatenate all the strings together Make suffix tree of concatenated string Make artificial suffixes actual suffixes –For any internal node v, L(v) must be a substring of an original string Only the leaf edge labels can span two original strings because of the uniqueness of each $ i –Postprocess and shorten leaf edge labels appropriately

3 Effects of alphabet size on suffix trees We have generally been assuming that the trees are built in such a way that –from any node, we can find an edge in constant time for any specific character in  an array of size |  | at each node This takes  (m|  |) space.

4 More compact representation We can try to be more compact taking only O(m) space. –At each node, have pointers to only the edges that are needed This slows down the search time How much? –typically the minimum of O(log m) or O(log |  |) with a binary tree representation. This effects both suffix tree construction time and later searching time against the suffix tree. Other methods are truly alphabet independent Z-compuation, KMP, BM all have running times and space requirements that are truly independent of the alphabet size. This can make them superior to suffix tree approaches when |  | is large.

5 Other methods are truly alphabet independent Z-computation, KMP, BM all have running times and space requirements that are truly independent of the alphabet size. This can make them superior to suffix tree approaches when |  | is large.

6 Generalized suffix trees Build a suffix tree for a set of strings S = {S 1, …, S z } Some issues Nodes in tree may corresponds to substrings of potentially multiple strings S i –compact edge labels: need 3 fields (start position, stop position, string) –leaf labels now a set of pairs indicating starting position and string

7 One way to compute Use a different end character $ i for each string S i Concatenate all the strings together Make suffix tree of concatenated string Make artificial suffixes actual suffixes –For any internal node v, L(v) must be a substring of an original string Only the leaf edge labels can span two original strings because of the uniqueness of each $ i –Postprocess and shorten leaf edge labels appropriately

8 Another way to compute Build tree for S 1 Given tree for strings S 1 through S i, add suffixes for S i+1 as follows: –Search for S i+1 in tree till mismatch in position j+1 of S i+1 –Existing tree implicitly has every suffix of S i+1 [1..j] –Resume Ukkonen’s algorithm for S i+1 in phase j+1 from point of last match


Download ppt "Two implementation issues Alphabet size Generalizing to multiple strings."

Similar presentations


Ads by Google