Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Suffix Trees © Jeff Parker, 2009. 2 Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently.

Similar presentations


Presentation on theme: "1 Suffix Trees © Jeff Parker, 2009. 2 Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently."— Presentation transcript:

1 1 Suffix Trees © Jeff Parker, 2009

2 2 Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently

3 3 Problems We have a corpus of information Genes Proteins What to see what to sequences have in common Want to be able to find matches for a gene or protein. Model this as a search for a pattern in a text. Problem is hard because Strings are very long The set of possible matches is large Today we will focus on exact matches

4 4 Pattern Matching The basis for the simplest (exact) pattern match follows Algorithm Line up text and pattern Compare the two If they match Report the position of match Else Slide pattern to right and try again Text Pattern

5 5 Compare pattern at this position // Does the pattern match the text at this position? boolean compare(String text, int pos, String pattern) { for (int i = 0; i < pattern length; i++) if (text[pos + i] =/= pattern[i]) return false; return true; }

6 6 Simple Pattern Match // Where is pattern pat in string text? int findMatch (String text, String pat) { int pos = 0; while (pos <= text.length - pat.length) { if (compare(text, pos, pattern)) return pos; pos++;// Slide pattern right one space } return -1; }

7 7 Analysis For pattern of length N and a text of length M This algorithm behaves well in practice: O(N + M) The worst case is bad: O(NM) We can do better if we preprocess Preprocess Pattern: Boyer-Moore, Knuth, Morris, Pratt Preprocess text: Suffix Tree

8 8 O(|pattern|) Pattern Matching Rather than view the problem as moving the pattern, rephrase

9 9 Faster Pattern Matching Is our pattern the prefix of a suffix of the text string S? Take all suffixes…

10 10 Faster Pattern Matching Take all suffixes and slide left

11 11 Faster Pattern Matching Want to find a string that has pattern as prefix

12 12 Sort suffixes

13 13 Build Trie Allows O(N) search for pattern

14 14 Suffix Trie Multi-way tree Each branch is labeled with char If the trie is ready, match takes O(|pattern|) time Example: text S is ababc s 1 = ababc s 2 = babc s 3 = abc s 4 = bc s 5 = c 1 a a c b b b b c c c c 3 5 2 4 a

15 15 Suffix Trie Suffix trie takes O(|S| 2 ) space Each step of search for match takes constant time If no branch matches char, we fail Leaf holds name of suffix We may have multiple matches String ab occurs twice Prefix of s1 and s3 1 a a c b b b b c c c c 3 5 2 4 a s 1 = ababc s 2 = babc s 3 = abc s 4 = bc s 5 = c

16 16 Suffix Tree Nodes that mark a split are called essential Remove non-essential nodes, and label edges with string 1 a a c b b b b c c c c 3 5 2 4 a 5 1 3 2 4 abc ab c c c abc b

17 17 Properties Tree has |S| leaves and 2|S|-1 edges |S|-1 interior nodes Algorithm for search is the same: walk the tree matching edges While this has less nodes, not clear that we need Any less storage? Sum of length of strings can still be O(N 2 ) Any speedup building tree? Storage is easier to address 5 1 3 2 4 abc ab c c c abc b

18 18 Worst Case Storage Here are some trees that need O(N 2 ) storage when stored as tries abcdefg We can get a trie that need O(N 2 ) storage with a limited alphabet: a n b n a n b n c 1 2 abcdefg bcdefg 3 cdefgdefgefg 4 5

19 19 Efficient Storage We store the whole string once, and keep pointers to that string in nodes We have constant space per node and O(|S|) nodes, thus linear space 5 1 3 2 4 abc ab c c c abc b 1, 2 a 1 b 2 a 3 b 4 c 5 2, 2 5, 5 3, 5 5, 5 3, 5 5, 5 sibling child

20 20 Applications: Longest Repeat As well as searching for a string, we can answer questions such as What is the longest string that is duplicated? What is the longest string that occurs k times? Internal nodes mark repeating substrings Keep track of the splits, and remember the deepest. In our example, s 1 and s 3 share ab 5 1 3 2 4 abc ab c c c abc b ababc

21 21 Longest Common Substring Given two strings S and T, find the longest common substring Build the suffix tree for the string S$T Mark leaves of suffixes that begin in S red Mark leaves of suffixes that begin in T black Make bottom up traverse, looking for lowest split that has leaves in both sets 5 1 3 2 4 abc ab c c c abc b

22 22 Applications: Longest Palindrome Given two strings S, find the longest common palindrome Build the suffix tree for the string S$S -1 Mark leaves of suffixes that begin in S red Mark leaves of suffixes that begin in S -1 black Look for lowest split that has leaves in both sets 5 1 3 2 4 abc ab c c c abc b

23 23 Linear Time Construction There is a long history of work mississippi ississippi ssissippi sissippi issippi ssippi sippi ippi ppi pi i Weiner 1973

24 24 Linear Time Construction There is a long history of work mississippi ississippi ssissippi sissippi issippi ssippi sippi ippi ppi pi i Weiner 1973 McCreight 1976

25 25 Linear Time Construction There is a long history of work mississippi ississippi ssissippi sissippi issippi ssippi sippi ippi ppi pi i Weiner 1973 McCreight 1976 Ukkonen 1992

26 26 McCreight Add the suffixes from longest to shortest We add a termination symbol, such as $, that does not appear in text This forces each addition to split the existing tree We can split (add a node and two edges) in constant time Can we find the place to do the splitting in constant time? Suffix links give amortized linear time. But first understand alg. ababc 2 1 babcababc 1 2 babc ab 1 abcc 3

27 27 Ukkonen Online algorithm: we don’t need to know all of string Grow all suffixes together. In step k, add S[k] to end of each suffix At some point, string s k will split from tree (s 2 breaks loose in step 2) After that, s k will never split again (though something may split from it) A split for s k may mean an similar split for s k+1 3 splits when adding c: s 3 splits from s 1, s 4 from s 2 and s 5 from root a... 1 ab... 1 b... aba.. 1 ba.. a.. 2 abab. 1 bab. ab. b. abc ab 5 1 2 2 2 3 4 b c c abc c

28 28 Review Introduce graphical notation for implicit nodes aba means both suffixes “a” and “aba” are on edge a... 1 ab... 1 b... aba.. 1 ba.. 2 abab. 1 bab. abc ab 5 1 2 2 2 3 4 b c c abc c ababc$ = s1 babc$ = s2 abc$ = s3 bc$ = s4 c$ = s5 $ = s6

29 29 Mississippi mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6 sippi$ = s 7 ippi$ = s 8 ppi$ = s 9 pi$ = s 10 i$ = s 11 $ = s 12 m... 1 mi... 1 2 i... mis... 1 2 is... 3 s... miss... 1 2 iss... 3 ss... s 4 is an implicit node s 4 is the active path Def: First non-leaf suffix remaining

30 30 Mississippi mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6 sippi$ = s 7 ippi$ = s 8 ppi$ = s 9 pi$ = s 10 i$ = s 11 $ = s 12 miss... 1 2 iss... 3 ss... s 4 is an implicit node (red s in s 3 edge) s 4 is the active path Def: First non-leaf suffix remaining When we add s[5] = i, active path s 4 splits s 5 becomes the active point. missi... 1 2 si... s issi... i... 3 4

31 31 Mississippi mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6 sippi$ = s 7 ippi$ = s 8 ppi$ = s 9 pi$ = s 10 i$ = s 11 $ = s 12 s 5 is the active path (First non-leaf suffix remaining) At end there are 3 non-leaf-suffixes (s 5, s 6, s 7 ) missi... 1 2 si... s issi... i... missis... 1 2 sis... s issis... is... mississ... 2 siss... s ississ... iss... 4 4 3 4 3 3 1

32 32 mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6 sippi$ = s 7 ippi$ = s 8 ppi$ = s 9 pi$ = s 10 i$ = s 11 $ = s 12 Add i Add p. Have never seen p, so all 4 (now 5) trailing suffixes split s 10, at root, becomes active path Mississippi mississi... 2 sissi... s ississi... issi... 4 3 1 1 2 5 7 9 4 6 3 mississip... s p... 8 i ssi ssip... si i ssip... p... ssip...

33 33 mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6 sippi$ = s 7 ippi$ = s 8 ppi$ = s 9 pi$ = s 10 i$ = s 11 $ = s 12 Redraw last diagram. About to add a second p. s 10 is active path, and it is at root Mississippi 2 5 7 9 4 6 3 8 issip i ssi p p p mississip 1 s i si pssip p

34 34 mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6 sippi$ = s 7 ippi$ = s 8 ppi$ = s 9 pi$ = s 10 i$ = s 11 $ = s 12 Active path is still s 10 It is trailing s 9 Mississippi 2 5 7 9 4 6 3 8 issipp i ssi pp p mississipp 1 s i si ppssipp pp ssipp

35 35 mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6 sippi$ = s 7 ippi$ = s 8 ppi$ = s 9 pi$ = s 10 i$ = s 11 $ = s 12 Add i. Forces split of s 10 from s 9. Active path is now s 11 Mississippi 2 5 7 9 4 6 3 8 issippi i ssi ppi p mississippi 1 s i si ppissippi ppi ssippi 10 pii

36 36 Algorithm We are building a tree, adding character S[k] to every suffix We traverse the boundary path - the growing edge of tree Boundary path includes Suffixes that have already become leaves Suffixes that currently end in implicit interior nodes We add character S[k] to the end of each suffix In general we have O(N) suffixes on boundary path, and we add each of N characters to each suffix on the boundary path, and we must navigate from suffix to suffix, which may be O(N) steps apart. How can we do this in O(N) time?

37 37 Algorithm We have O(N) suffixes on boundary path, We add each of N characters to each suffix on the boundary path, We navigate from suffix to suffix, which may be O(N) steps apart. How can we do this in O(N) time? Ans: We cheat. Here are three big ideas (will explain each in detail) 1) Once a path has split off, updating it is free, so we ignore it 2) Rather than “walk” the boundary edge as we add a new character, we only need to watch one representative: the active path - the longest suffix that is not yet a leaf 3) When we do need to walk the boundary path there is a cheap way to walk from suffix to suffix, by creating suffix links

38 38 Leaves are Cheap 1) Once a path has split off, “updating” it is free We represent a leaf that splits at character S[k] as the string S[k..whatever] If some later suffix is following our path, it is up to him to find the point of difference S 5 is following S 2, but S 2 is a leaf and does not care We don’t even need to know the length of the string (whatever) mississi... 2 s ississi... issi... 4 3 1 mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6

39 39 Active Path 2) We can focus our attention on the longest suffix that has not yet broken free, called the active path. This represents rest of boundary path Assume active path is the suffix S i and we are have just added char S[k] Assume that S i is a prefix of suffix S j up to this point Then S i+1 is a prefix of suffix S j+1 and so on Proof: S i+1 is just S i without character S[i] The converse is not true. S i may leave the tree while S i+1 remains in the tree S[i..k] SiSi S[j..k] SjSj S[i+1..k] S i+1 S[j+1..k] S j+1 This means that we only need to watch S 5 mississi... 2 s ississi... issi... 4 3 1

40 40 mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6 sippi$ = s 7 ippi$ = s 8 ppi$ = s 9 pi$ = s 10 i$ = s 11 $ = s 12 Add p. Have never seen p, so s 5, s 6, s 7, s 8 and s 9 all split. s 10, which is currently at the root, becomes the new active path Review example mississi... 2 sissi... s ississi... issi... 4 3 1 1 2 5 7 9 4 6 3 mississip... s p... 8 i ssi ssip... si i ssip... p... ssip... S 5 is a prefix of S 2 S 6 is a prefix of S 3 S 7 is a prefix of S 4 S 8 is a prefix of S 5

41 41 Suffix Links 3) There is a cheap way to walk the boundary path Once the active path splits, we need to walk the boundary path until splitting stops To explain the suffix link, return to our view as a trie for ababc We have inserted s[1] through s[4], about to insert s[5] = c s 1 points to s 2, which points to s 3, which points to s 4, which points to root I know I will have no problems with leaves s 1 and s 2 : active path is s 3 When I find that s 3 needs to split from s 1, I need to check s 4 as well, and perhaps s 5 I follow the suffix pointers from s 3 a a b b b b a

42 42 Accounting I add one character at a time to one suffix - the active path This is clearly linear When the active path splits, I need to start walking the boundary path from old active path to new end path (point were the splitting stops) Any individual character may cause lots of splitting, but each suffix only splits once. Amortized cost is linear To walk the boundary path, I update the suffix links. This can also be amortized. a a b b b b a a a b b b b a cc c

43 43 Building Suffix Links When we split, we need to add new nodes These nodes will need new suffix links We are showing a chain of suffix links

44 44 Building Suffix Links When we split, we need to add new nodes These nodes will need new suffix links

45 45 Building Suffix Links When we split, we need to add new nodes These nodes will need new suffix links

46 46 Building Suffix Links When we split, we need to add new nodes These nodes will need new suffix links

47 47 Building Suffix Links When we split, we need to add new nodes These nodes will need new suffix links

48 48 Building Suffix Links When we split, we need to add new nodes These nodes will need new suffix links

49 49 Building Suffix Links When we split, we need to add new nodes These nodes will need new suffix links

50 50 Building Suffix Links When we split, we need to add new nodes These nodes will need new suffix links

51 51 Canonize We represent a suffix as an explicit node and a (growing) string of characters Start with (n 1 (a)) Add characters bbac to get (n 1 (abbac)) We canonize this in a sequence of steps to get a better representation (n 2 (bac)) (n 3 (c)) This allows us to use the suffix link at n 3 rather than the suffix link at n 1 ab ba ca n1n1 n2n2 n3n3 n4n4

52 52 Post mortem Algorithm to build Suffix Tree is linear in time and space. We haven’t proved this, but perhaps it is now plausible But is the algorithm practical? There are real issues when dealing with long strings The human genome has about 3 billion base pairs Keeping the suffix links updated can cause thrashing as we walk all over the suffix tree representing this The suffix tree is important enough that people are working the issue One idea that is easy to describe: merging suffix trees

53 53 References A great reference to the field is Dan Gusfield’s Algorithms on Strings, Trees, and Sequences P. Weiner (1973). "Linear pattern matching algorithm". 14th Annual IEEE Symposium on Switching and Automata Theory, 1-11. Edward M. McCreight (1976). "A Space-Economical Suffix Tree Construction Algorithm". Journal of the ACM 23 (2): 262--272. E. Ukkonen (1995). "On-line construction of suffix trees". Algorithmica 14 (3): 249--260. R. Giegerich and S. Kurtz (1997). "From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction". Algorithmica 19 (3): 331--353. Gusfield, Dan [1997] (1999). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press.


Download ppt "1 Suffix Trees © Jeff Parker, 2009. 2 Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently."

Similar presentations


Ads by Google