Algorithms in Bioinformatics: A Practical Introduction Suffix tree.

Algorithms in Bioinformatics: A Practical Introduction Suffix tree

Overview What is suffix tree? Simple application of suffix tree Linear time algorithm for constructing suffix tree Suffix array FM-index 1-mismatch search

Suffix Trie E.g. consider the string S = acacag$ Suffix Trie: a ties of all possible suffices of S $a c a g c a g c g 7 c g 1 3 5 2 4 6 a gg a $7 g$6 ag$5 cag$4 acag$3 cacag$2 acacag$1 Suffix $ $ $$ $ $

Suffix Tree Suffix tree for S=acacag$: merge nodes with only one child S= v $a c a g c a g c g c g 7 13 5 24 6 a g g $ $ $ $ $ $ a Path-label of node v is “ aca ” Denoted as  (v) “ ca ” is an edge label This is a leaf edge

Size of Suffix Tree (I) How big is a suffix tree? Suffix tree has exactly n leaves and at most 2n-1 edges The total length of all edge labels is O(n 2 ). Can we store suffix tree using o(n 2 ) bit space? S= $a c a g c a g c g c g 7 13 5 24 6 a g g $ $ $ $ $ $ a

Size of Suffix Tree (II) Suffix tree has exactly n leaves and at most 2n-1 edges Note that each edge label can be represented using 2 indices Thus, suffix tree can be represented using O(n log n) bits S= $a c a g c a g c g c g 7 13 5 24 6 a g g $ $ $ $ $ $ 6,7 4,7 a 7,7 1,1 2,3 Note: The end index of every leaf edge should be 7, the last index of S. Thus, for leaf edges, we only need to store the start index.

Property of suffix tree Fact: For any internal node v in the suffix tree, if the path label of v is  (v)=ap, then there exists another node w in the suffix tree such that  (w)=p. Proof: Skip the proof. Definition of Suffix Link: For any internal node v, define its suffix link sl(v) = w.

Suffix Link example S=acacag$ $a c a g c a g c g c g 7 13 5 24 6 a g g $ $ $ $ $ $ a

Generalized suffix tree Build a suffix tree for two or more strings E.g. S 1 = acgat#, S 2 = cgt$ a c g g t a t t a t c 4 1 2 1 32 3 t a t $ $ $ # g t # 5 # # 4 6 #$ #

Applications of Suffix Tree

Exact string matching problem To find all occurrences of Q in S (searching) Search for the node x in the suffix tree which represent Q All the leaves in the subtree rooted at x are the occurrences Time: O(|Q| + occ) where occ is the total no. of occurrences $a c a g c a g c g c g 7 1 3 5 24 6 a g g $ $ $ $ $ $ a E.g. S = acacag$ Q = aca Occurrences: 1, 3

Longest repeated substring problem To find the longest repeated substring in S Find the deepest internal node Time: O(n) $a c a g c a g c g c g 7 13 5 24 6 a g g $ $ $ $ $ $ a E.g. S = acacag$ The longest repeat is aca.

Longest common substring problem To find the longest common substring of two or more sequences Note: 1970, Don Knuth conjectured that a linear time algorithm for this problem is impossible Now, we know that it can be solved in linear time. E.g. consider two string S 1 and S 2, 1. Build generalized suffix tree for S 1 # and S 2 $ 2. Then, mark each internal node with leaves representing suffixes of both S 1 and S 2. 3. Report the deepest marked node

Example for the longest common substring E.g. S 1 = acgat#, S 2 = cgt$ The longest common substring is “ cg ”. Its length is 2. a c g g t a t t a t c 4 1 2 1 32 3 t a $ $ $ # g t # 5 # # 4 6 #$ # t

Longest common prefix (I) Given a string S. For any i, j, Denote lcp(i, j) be the length of the longest common prefix of suffix i and j of S. $a c a g c a g c g c g 7 13 5 2 4 6 a g g $ $ $ $ $ $ a S= The longest common prefix of suffix 1 and suffix 3 is aca! lcp(1, 3) = 3

Longest common prefix (II) Note that the lowest common ancestor(lca) of leaves i and j identifies the longest common prefix. lcp(i, j) = |  (lca(i, j))|. A well-know result: Consider a tree of size n, after an O(n) time preprocessing, the lca for any two nodes can be returned in O(1) time. First obtained by Harel and Tarjan (SIAM J. Comp. 1984) Simplified by Schieber and Vishkin (SIAM J. Comp. 1988) Based on the above result, After an O(n) time preprocessing, For any suffix i and suffix j, we can compute the longest common prefix of them in O(1) time.

Finding Palindrome (I) Given a string S, palindrome is a substring u of S s.t. u = u r E.g. ACAGACA Consider a palindrome u=S[i..i+|u|-1], u is called a maximal palindrome if S[i ’..j ’ ] is not a palindrome for any [i ’..j ’ ]  [i..i+|u|-1]. Note that every palindrome is contained in a maximal palindrome. Thus, maximal palindromes are a compact way to represent all palindromes. Complemented Palindrome is a string u s.t. u = ū r E.g. ACAUGU Maximal complemented palindrome is defined similarly.

Finding Palindrome (II) Recall that restriction enzyme usually is in the form of complemented palindrome. This motivates the following two problems: The palindrome problem: Given a string S (representing the genome) of length n, the problem is to locate all maximal palindromes in S. The complemented palindrome problem: Given a string S (representing the genome) of length n, the problem is to locate all maximal complemented palindromes in S.

Properties of palindrome (I) If S[i..i+k-1]=S r [n-i+1..n-i+k], then u=S[i-k+1..i+k-1] is an odd length palindrome

Properties of palindrome (II) If S[i..i+k-1]=S r [n-i+2..n-i+k+1], then u=S[i-k..i+k-1] is an even length palindrome

Solution to the palindrome problem Preprocess S and S r so that any longest common prefix query can be answered in constant time. For i=1 to n, Find the longest common prefix for (S i, S r n-i+1 ). If the longest prefix is k, we find an odd length maximal palindrome S[i-k+1..i+k-1]. Find the longest common prefix for (S i, S r n-i+2 ). If the longest prefix is k, we find an even length maximal palindrome S[i-k..i+k-1].

Extracting embedded suffix tree from a generalized suffix tree Input: The generalized suffix tree T of K strings S 1, …, S K. Aim: Compute the suffix tree T i of the string S i. r x y a c g g t a t t a t c 4 1 2 1 32 3 t a t $ $ $ # g z w t # 5 # # 4 6 #$ # r a c g g t a t a c 4 1 23 t a t # g w t # 5 # # 6 # # S 1 = acgat#, S 2 = cgt$ T T1T1

Extracting embedded suffix tree from a generalized suffix tree Observation: T i is a subtree of T such that The leaves of T i are the leaves of T corresponding to S i. The internal nodes of T i are the lowest common ancestors of some leaves for S i. The edges of T i can be inferred from the ancestor descendent relationship among those nodes. r x y a c g g t a t t a t c 4 1 2 1 32 3 t a t $ $ $ # g z w t # 5 # # 4 6 #$ # r a c g g t a t a c 4 1 23 t a t # g w t # 5 # # 6 # # S 1 = acgat#, S 2 = cgt$ T T1T1

Extracting embedded suffix tree from a generalized suffix tree r a c g g t a t a c 4 1 23 t a t # g w t # 5 # # 6 # # r a c g g a t a c 4 1 23 ta t # g w t # # 6 # # r a c g a t c 412 a t g w t # # 6 # # r a c 4 1 a t g w t # 6 # # r a c 1 a t g 6 # # r 6 #     

Common substrings of more than 2 strings (I) Given a set of strings (protein or DNA sequences), we want to know what substrings are common to a large number of these strings? Why this question is important? DNA and protein sequences will evolve. If a substring occur commonly in wide range of species. This may mean that the substring is critical for the correct functionality.

Common substrings of more than 2 strings (II) Given K strings whose total length is n. For every 2  k  K, define l(k) be the length of the longest substring common to at least k of these strings. The problem is to compute l(k) for all k.

Common substrings of more than 2 strings (III) Example: Consider a set of 5 strings { sandollar, sandlot, handler, grand, pantry } Then, we have kl(k)corresponding substring 24sand 33and 43 52an

Common substrings of more than 2 strings (IV) Illustrating the solution by example: S 1 = aacg$, S 2 = acgc#, S 3 = cga%. (K=3) 1. Build a generalized suffix tree T for the K strings in O(n) time. $ a c g c g g c 5 21 $ # 54 #% 3 % 1 g $ c a 4 # c 3 1 $ # 2 a % c 4 2 $ # 3 a %

Common substrings of more than 2 strings (V) 2. By traversing T, for each internal node v, compute its string depth. In total, O(n) time. 0 1 1 3 $ a c g c g g c 5 21 $ # 54 #% 3 % 1 g $ c a 4 # 2 c 3 1 $ # 2 a % 1 c 4 2 $ # 3 a %

Common substrings of more than 2 strings (VI) 3. By traversing T, for each internal node v, compute C(v). [C(v) is defined as the number of distinct termination symbols in the subtree rooted at v] This step takes O(Kn) time. 3 3 3 2 $ a c g c g g c 5 21 $ # 54 #% 3 % 1 g $ c a 4 # 3 c 3 1 $ # 2 a % 3 c 4 2 $ # 3 a %

Common substrings of more than 2 strings (VII) 4. Traverse T and visit every internal node v. For each v, if V(C(v)) < string-depth of v, set V(C(v)) = string-depth of v. [After step 4, V(k) = the length of the longest substring common to exactly k of these strings.] 5. l(k)=V(k). For i=k-1 downto 2, l(i)=max{l(i+1), V(i)}. This two steps take O(n) time. For our example, V(2) = 3, V(3) = 2. Thus, l(3) = 2, l(2) = 3. In total, this algorithm takes O(Kn) time. Actually, we can improve this algorithm to O(n) time by mean of lcp!

Linear time algorithm for constructing suffix tree

Straightforward construction of suffix tree Consider S = s 1 s 2 … s n where s n =$ Algorithm: Initialize the tree with only a root For i = n to 1 Includes S[i..n] into the tree Time: O(n 2 )

Example of construction S=acca$  InitFor-loop $ 5 I5I5 a 4 I4I4 5 $ $  a 4 I3I3 5 $ $ 3 a $ c I2I2 c c a a 3 2 a 4 5 $ $ $ $  I1I1 c c a a 3 2 5 $ $ $ c c a a 4 1 $ $

Construction of generalized suffix tree S ’ = c#  I1I1 c c a a 3 2 5 $ $ $ c c a a 4 1 $ $ InitFor-loop J2J2 c c a a 3 2 5 $ $ $ c c a a 4 1 $ $ 2 #  J1J1 c c a a 3 2 5 $ $ $ c c a a 4 1 $ $ 2 # # 1

Can we construct a suffix tree in o(n 2 ) time? Yes. We can construct it in O(n) time. Weiner ’ s algorithm [1973] Linear time for constant size alphabet, but much space McGreight ’ s algorithm [JACM 1976] Linear time for constant size alphabet, quadratic space Ukkonen ’ s algorithm [Algorithmica, 1995] Online algorithm, linear time for constant size alphabet, less space Farach ’ s algorithm [FOCS 1997] Linear time for general alphabet Hon,Sadakane, and Sung ’ s algorithm [FOCS 2003] O(n) bit space O(n log e n) time for 0<e<1 O(n) bit space O(n) time for suffix array construction We will discuss Farach ’ s algorithm later.

Idea Build Odd Suffix Tree and Even Suffix Tree Then, merge odd and even suffix tree. $a c a g c a g c g c g 7 13 5 2 4 6 a g g $ $ $ $ $ $ a $a c a g c a g c g c g 7 13 5 2 4 6 a g g $ $ $ $ $ $ a Even Suffix TreeOdd Suffix Tree

Idea Input: a string S of length n 1. Recursively compute the suffix tree T o of all suffixes beginning at the odd positions. T o is of size n/2. 2. From T o, compute T e which is the suffix tree for all suffixes beginning at the even positions. 3. Merge T o and T e to form the suffix tree for S.

Stage 1: Constructing odd suffix tree Given a string S[1..n], we generate a new string S ’ [1..n/2] as follows. we map pairs of characters into single characters as follows: S[1..2], S[3..4], S[5..6], …, S[n-1..n]. Remove the duplicates from the pairs of characters and sort them by radix sort. S ’ [i] = rank of S[2i-1..2i] in the sorted list, for i=1, 2, …, n/2. By recursion, we get the suffix tree T ’ for S ’ Convert T ’ to the odd suffix tree T o.

Example (I) S = aaabbbabbaba$ S[1..2]=aa, S[3..4]=ab, S[5..6]=bb, S[7..8]=ab, S[9..10]=ba, S[11..12]=ba. By stable sort, aa < ab < ba < bb. Rank(aa)=1, Rank(ab)=2, Rank(ba)=3, Rank(bb)=4. So, S ’ =124233$.

Example (II) By recursion, construct the suffix tree T ’ for S ’ : 7 $ 1 124233$124233$ 4 33$33$ 2 2 24233$24233$ 3 65 3$3$ $ 3 4233$

Example (III) Convert T ’ to the odd tree: 13 $ 1 aaabbbabbaba$aaabbbabbaba$ 7 baba$baba$ abab 3 ba 119 ba$ba$ $ 5 bbabbaba$ bbbabbaba$bbbabbaba$ This is not a suffix tree i  2i-1

Example (IV) Refine the odd tree T o : 13 $ 1 aabbbabbaba$aabbbabbaba$ 7 aba$aba$ bbbb 3 119 ba$ba$ $ 5 babbaba$ bbabbaba$bbabbaba$ b a a

Time complexity for building the odd tree Let Time(n) be the time to build a suffix tree for a string of length n. Stable sorting and refinement of the odd trees take O(n) time. Build suffix tree for S ’ takes Time(n/2). So, Stage 1 takes Time(n/2)+O(n) time.

Stage 2: Build the even tree 1. Generate the lex-ordering of the leaves in T e. 2. For any two adjacent leaves 2i and 2j, we find lcp(2i, 2j). 3. Construct the even tree T e from left to right (according to the lex-ordering).

Build the even tree (Step 1) We get the lex-ordering of the leaves in T o. Generate the lex-ordering of the leaves in T e. For each leaf i in T o, get the preceding character c=S[i-1] and form a pair (c,i). Each pair represents a even suffix i-1. Perform stable sorting on those pairs. We get the lex-ordering of the leaves in T e.

Example Lex-ordering of the leaves in T o : 13 < 1 < 7 < 3 < 11 < 9 < 5 The pairs are: (a,13), ($,1), (b,7), (a, 3), (a, 11), (b, 9), (b, 5). After stable sorting, we have ($, 1), (a, 13), (a, 3), (a, 11), (b, 7), (b, 9), (b, 5). Hence, the lex-ordering of the leaves of T e : 12 < 2 < 10 < 6 < 8 < 4 S = aaabbbabbaba$

Build the even tree (Step 2) For any two adjacent leaves 2i and 2j, we first find lcp(2i, 2j). Observation: lcp(2i, 2j) = lcp(2i+1, 2j+1)+1 if S[2i]=S[2j] 0 otherwise Proof: If S[2i]  S[2j], lcp(2i,2j)=0. Otherwise, lcp(2i,2j)=1+lcp(2i+1,2j+1).

Example Recall that the lex- ordering of leaves: 12 < 2 < 10 < 6 < 8 < 4. By the previous observation, we have lcp(8,4)=lcp(9,5)+1=2 Similarly, we have lcp(12,2)=1, lcp(2,10)=1, lcp(10,6)=0, lcp(6,8)=1, lcp(8,4)=2 13 $ 1 aabbbabbaba$aabbbabbaba$ 7 aba$aba$ bbbb 3 119 ba$ba$ $ 5 babbaba$ bbabbaba$bbabbaba$ b a a

Build the even tree (Step 3) Construct the even tree T e from left to right. 12 a a$a$ $ 2 abbbabbaba$abbbabbaba$ a $ 2 abbbabbaba$abbbabbaba$ 10 ba$ba$

Build the even tree (Step 3) a 12 $ 2 abbbabbaba$abbbabbaba$ 10 ba$ba$ 6 babbaba$babbaba$ a 12 $ 2 abbbabbaba$abbbabbaba$ 10 ba$ba$ 6 abbaba$abbaba$ b 8 baba$baba$ a 12 $ 2 abbbabbaba$abbbabbaba$ 10 ba$ba$ 6 abbaba$abbaba$ b 8 aba$aba$ 4 babbaba$babbaba$ b

Time complexity for building the even tree Step 1: O(n) time Step 2: O(n) time Step 3: O(n) time

Stage 3: Merge odd and even trees We can merge T o and T e by DFS. However, it takes O(n 2 ) time. Even tree Odd tree 13 $,1 1 73 119 $,1 b,1 a,1 12 $,1 a,1 2 b,1 10 a,2 a,4 b,9 6 b,1 b,5 8 4 b,1 a,2 5 b,5 b,8 a,11 b,10 a,2 b,1 13 $,1 1 73 119 $,1 b,1 a,1 12 $,1 a,1 2 b,1 10 a,2 a,4 b,9 6 b,1 b,5 8 4 b,1 a,2 5 b,5 b,8 a,2 b,1

Stage 3: Merge odd and even trees We merge T o and T e by DFS. We merge two edges as long as they start with the same character. The merge is ended when one edge is longer than the other. 13 $ 1 aabbbabbaba$aabbbabbaba$ 7 aba$aba$ bbbb 3 119 ba$ba$ $ 5 babbaba$ bbabbaba$bbabbaba$ b a a a 12 $ 2 abbbabbaba$abbbabbaba$ 10 ba$ba$ 6 abbaba$abbaba$ b 8 aba$aba$ 4 babbaba$babbaba$ b

Merge odd and even trees We merge T o and T e by DFS. We merge two edges as long as they start with the same character. The merge is ended when one edge is longer than the other. 13 $,1 2 7 3 119 $,1 b,1 a,1 12 $,1 a,11 1 $,1 b,2 10 $,1 a,4 b,9 6 b,3 8 4 b,1 a,4 5 b,3 b,8

Merge odd and even trees The merging may over-merged some nodes. To correct the tree, we need to unmerge some nodes. 13 $,1 2 7 3 119 $,1 b,1 a,1 12 $,1 a,11 1 $,1 b,2 10 $,1 a,4 b,9 6 b,3 8 4 b,1 a,4 5 b,3 b,8

Definition of L() and d() For every node u which may be over-merged, there exist two leaves 2i and 2j-1 such that u=lca(2i, 2j-1). Denote L(u) be the correct depth of u, that is, lcp(2i,2j-1). Note that lcp(2i,2j-1) = 1+lcp(2i+1,2j) if S[2i]=S[2j-1]; 0 otherwise. Let v be lca(2i+1,2j). Denote d(u) = v. Note that d() is equivalent to suffix link!

Example of d() 13 $,1 2 7 3 119 $,1 b,1 a,1 12 $,1 a,11 1 $,1 b,2 10 $,1 a,4 b,9 6 b,3 8 4 b,1 a,4 5 b,3 b,8 d()

Relationship between L() and d() Suppose u = lca(2i, 2j-1). Note1: if u is not the root, then S[2i]=S[2j-1]. Note2: lcp(2i,2j-1) = 1+lcp(2i+1,2j) if S[2i]=S[2j-1]; 0 otherwise Note3: d(u) = lcp(2i+1,2j) Hence, L(u) = 1 + L(d(u)) if u is not the root. Otherwise, L(u)=0. Lemma: L(u) = the length of the purple path from u to the root.

Example of L() 13 $,1 2 7 3 119 $,1 b,1 a,1 12 $,1 a,11 1 $,1 b,2 10 $,1 a,4 b,9 6 b,3 8 4 b,1 a,4 5 b,3 b,8 L( )=2 L( )=3 L( )=4 L( )=2 L( )=1

Unmerge the border nodes based on L() (I) 13 $,1 2 7 3 119 $,1 b,1 a,1 12 $,1 a,11 1 $,1 b,2 10 $,1 a,4 b,9 6 b,3 8 4 b,1 a,4 5 b,3 b,8 L( )=2 L( )=3 L( )=4 L( )=2 L( )=1

Unmerge the border nodes based on L() (II) 13 $,1 1 73 119 $,1 b,1 a,1 12 $,1 a,1 2 b,1 10 a,2 a,4 b,9 6 b,1 b,5 8 4 b,1 a,2 5 b,5 b,8 a,11 b,10 a,2 b,1

Time complexity for merging Merge the tree using DFS takes O(n) time. Compute the links d() takes O(n) time. Compute L() takes O(n) time. Unmerge takes O(n) time.

Total time complexity of Farach ’ s algorithm Stage 1: Time(n/2)+O(n) Stage 2: O(n) Stage 3: O(n) Thus, Time(n) = Time(n/2)+O(n). By solving the equation, Time(n)= O(n).

Disadvantage of suffix tree Suffix tree is space inefficient. It requires O(n|  |log n) bits. Manber and Myers (SIAM J. Comp 1993) proposes a new data structure, called suffix array, which has a similar functionality as suffix tree. Moreover, it only requires O(n log n) bits.

Suffix Array (I) It is just sorted suffixes. E.g. consider S = acacag$ Suffix array is an array of n indices. Thus, it takes O(n log n) bits. SuffixPositionSA[i]Suffix acacag$17$ cacag$21acacag$ acag$3=>3acag$ cag$4 Sort 5 ag$ 52cacag$ g$64cag$ $76g$

Observation The leaves of a suffix tree is in suffix array order. $a c a g c a g c g c g 7 13 5 24 6 a g g $ $ $ $ $ $ a SA[i]Suffix 7$ 1acacag$ 3acag$ 5 ag$ 2cacag$ 4cag$ 6g$

Linear time construction of suffix array from suffix tree Recall that the suffix tree T of S[1..n] can be constructed in O(n) time. Then, by “ lexical ” depth-first traversal of T, the suffix array of S is obtained. This takes O(n) time. However, the space used during construction is the same as that for suffix tree! This defeats the purpose of suffix array. Today, we can build suffix array using O(n) bit space and O(n) time.

range(T,Q) For a pattern Q, its occurrences in T form a consecutive SA range. Example: For T= acacag$, ca occurs in SA[5] and SA[6]. Definition: We called range(T,Q)=[st..ed] if Q is a prefix of every T j for j=SA[st], SA[st+1], …, SA[ed] where T j = j suffix of T = T[j..n]. Example: range(T,ca)=[5..6] SA[i]Suffix 7$ 1acacag$ 3acag$ 5 ag$ 2cacag$ 4cag$ 6g$ 1 2 3 4 5 6 7

Find occurrence of query Q in a string S using suffix array Input: (1) the suffix array of a string T of length n and (2) a query Q of length m Aim: check if Q occurs in T Idea: binary search!

Algorithm

Example Consider T = acacag$ Pattern Q = acag L=1 R=7 M=(L+R)/2=4 SA[i]Suffix 7$ 1acacag$ 3acag$ 5 ag$ 2cacag$ 4cag$ 6g$ i 1 2 3 4 5 6 7

Example Consider T = acacag$ Pattern Q = acag L=1 R=7 M=(L+R)/2=4 suffix-SA[M] > Q. Set R=M=4. SA[i]Suffix 7$ 1acacag$ 3acag$ 5 ag$ 2cacag$ 4cag$ 6g$ i 1 2 3 4 5 6 7

Example Consider T = acacag$ Pattern Q = acag L=1 R=4 M=(L+R)/2=2 suffix-SA[M] < Q. Set L=M=2. SA[i]Suffix 7$ 1acacag$ 3acag$ 5 ag$ 2cacag$ 4cag$ 6g$ i 1 2 3 4 5 6 7

Example Consider T = acacag$ Pattern Q = acag L=2 R=4 M=(L+R)/2=3 The pattern Q is found at SA[M]=3. SA[i]Suffix 7$ 1acacag$ 3acag$ 5 ag$ 2cacag$ 4cag$ 6g$ i 1 2 3 4 5 6 7

Can we do better? During each step of binary search, we need to compare Q with a suffix using O(m) time, which is time consuming. Can we do better? We have the following observation. Suppose LCP(Q, suffix-SA[L]) is l and LCP(Q, suffix-SA[R]) is r. Then, LCP(Q, suffix-SA[M]) > min{l,r}. Below, we describe how to utilize this observation to speedup the computation.

Algorithm

Example Consider T = acacag$ Pattern Q = acag L=1, l=0 R=7, r=0 mlr = min(l,r)=0 M=(L+R)/2=4 SA[i]Suffix 7$ 1acacag$ 3acag$ 5 ag$ 2cacag$ 4cag$ 6g$ i 1 2 3 4 5 6 7

Example Consider T = acacag$ Pattern Q = acag L=1, l=0 R=7, r=0 mlr = min(l,r)=0 M=(L+R)/2=4, m=1 The (m+1) char of suffix-SA[M] is g. The (m+1) char of Q is c. So, suffix-SA[M] > Q. Set R=M=4 and r=m=1. SA[i]Suffix 7$ 1acacag$ 3acag$ 5 ag$ 2cacag$ 4cag$ 6g$ i 1 2 3 4 5 6 7

Example Consider T = acacag$ Pattern Q = acag L=1, l=0 R=4, r=1 mlr = min(l,r)=0 M=(L+R)/2=2, m=3 The (m+1) char of suffix-SA[M] is c. The (m+1) char of Q is g. So, suffix-SA[M] < Q. Set L=M=2 and l=m=3. SA[i]Suffix 7$ 1acacag$ 3acag$ 5 ag$ 2cacag$ 4cag$ 6g$ i 1 2 3 4 5 6 7

Example Consider T = acacag$ Pattern Q = acag L=2, l=3 R=4, r=1 mlr = min(l,r)=1 M=(L+R)/2=3, m=4 The pattern Q is found at SA[M]=3. SA[i]Suffix 7$ 1acacag$ 3acag$ 5 ag$ 2cacag$ 4cag$ 6g$ i 1 2 3 4 5 6 7

Time analysis Binary search will perform log n comparisons Each comparison takes at most O(m) time In the worst case, O(m log n) time. Myers and Manber report that, in practice, the time is O(m + log n).

Suffix array and suffix tree We show one example of replacing suffix tree by suffix array Note that most applications related to suffix tree can be solved using suffix array with some time blow up! When space is limited, replacing suffix tree by suffix array is a good choice.

The size is still too big! Why? DNA sequences can be very long! E.g. Fly: ~100M bases, Human: ~3G bases, Tree: ~9G bases Storage to store indexing data structure for human genome Suffix Tree: ~40G bytes Suffix Array: ~13G bytes Can we further reduce the space?

Solution Grossi, Vitter (STOC2000) Compressed suffix array (CSA) Ferragine, Manzini (FOCS2000) FM-index Both of them can be stored in O(n) bit space For Human Genome Both CSA and FM-index can be stored within 2G bytes.

FM-index Consider a text T=acacag$ FM-index stores: A. The string BW=g$ccaaa B. C[x] = total no. of occurrences of each symbol less than x. E.g. C[a]=1, C[c]=4, C[g]=6, C[t]=7 C. A data-structure occ(x, i) which tells us the number of occurrences of x in BW[1..i] using O(1) time.

Data-structure for answering the occ(x, i) query? Given the text BW[1..n], we divide BW[1..n] into buckets of size log 2 n. For each bucket i = 1, …, n/log 2 n, we store P[i] = number of x ’ s in BW[1.. ilog 2 n]. Each bucket is further subdivided into log n sub-buckets of size log n. For each sub-bucket j of the bucket i, we store Q[i][j] = number of x ’ s in the first j sub-buckets. ……… BW log 2 n ……… log n 2 n/log 2 n 341 123 log n 00x0xxx0x0

Data-structure for answering the occ(x, i) query? We also need a lookup table rank(b,k) b is any string of length (log n)/2 1  k  (log n)/2 rank(b,k) = number of x in the first k characters of b. 00 … 0 xx … x 1 2 3 ……………… (log n)/2 x0x..0 … … 1 1 2 ………………………

Space complexity of the occ() data-structure P[1..n/log 2 n] uses O(n/log n) bits Q[1..n/log 2 n][1..log n] uses O(n log log n / log n) bits rank(b,k) uses 2 logn/2 (log n/2) = o(n) bits In total, we use O(n log log n / log n) bits.

How to compute occ(x,i)? Suppose log n = 10. To compute occ(x, 327), The result is P[3]+Q[4][2]+rank(00x0x,5)+rank(xx0x0,2) Hence, O(1) time to compute occ(x,i). ……… BW log 2 n ……… log n 2 n/log 2 n 341 123 log n 00x0xxx0x0

Size of FM-index Structure A can be store in 2n bits Structure B can be store in O(log n) bits Structure C can be store in O(n log log n/log n) bits In total, the size of FM-index is O(n) bits.

Observation C[x]+occ(x,i) is the number of suffixes smaller than of xT SA[i]. Example: C[c]=4 occ(c, 6)=2 Number of suffixes smaller than cT[SA[6]..n]=ccag$ is 6.

Lemma Suppose range(T,Q) is [st..ed]. Then, range(T,xQ) = [p..q] where p = C[x] + occ(x,st-1) + 1 q = C[x] + occ(x,ed) Proof: p = 1+number of suffixes strictly smaller than xQ. The latter term = number of suffixes smaller than or equal to xT SA[st-1]. q = number of suffixes smaller than or equal to xT SA[ed].

Backward search Given the text T and the FM-index, we want to determine if Q exists in T. Algorithm BW_exist(Q[1..m]) 1. x=Q[m], i=m-1; 2. /* find range(T,Q[m]) */ st = C[x]+1, ed = C[x+1]; 3. while (st  ed and i>1) { /* find range(T, Q[i-1..m]) */ x = Q[i-1]; st = C[x] + occ(x, st-1) + 1; ed = C[x] + occ(x, ed); i = i – 1; } 4. if st > ed, then pattern not found else pattern found.

Example T = acacag$ Q = aca Q[3..3]=a sp=C[a] +1 =1+1=2 ep=C[c] =4 g$ccaaa

Example T = acacag$ Q = aca Q[2..3]=ca st=C[c]+occ(c,st old -1)+1 =4+0+1=5 ed=C[c]+occ(c,ed old ) =4+2=6 g$ccaaa

Example T = acacag$ Q = aca Q[1..3]=aca st=C[a]+occ(a,st old -1)+1 =1+0+1=2 ed=C[a]+occ(a,ed old ) =1+2=3 g$ccaaa

Example T = acacag$ Q = aca Q[1..3]=aca st=C[a]+occ(a,st old -1)+1 =1+0+1=2 ed=C[a]+occ(a,ed old ) =1+2=3 g$ccaaa Q occurrences in T!

Time complexity of Backward Search To find a pattern Q[1..m] Step 1, 2, and 4 can be computed in O(1) time. For step 3, We need to iterate the loop for m-1 times. Each iteration of the loop can be computed in O(1) time. The loop takes O(m) time. In total, O(m) time for backward search.

Conclusion Suffix tree is a powerful data-structure which has a lot of applications in Computational Biology. Problems: Suffix tree is too big! Can be solved using CSA and FM-index Suffix tree can only solve exact match problem! (Most of the biology problems are approximate match!) Many works have been done on this area! But still not pratical. One of the important area to explore!

1-mismatch problem

Index: the suffix tree of a text T[1..n] For any pattern P[1..m], the 1-mismatch problem finds all occurrences of P in T that has hamming distance at most 1. Example: P = ACGT T = AACGTGGCCAACTTGGA

Na ï ve solution Index: create the suffix tree for T Algorithm for query: Generate all possible 1-mismatch patterns of P. Find occurrences of every 1-mismatch pattern Running time: There are |  |m possible 1-mismatch patterns. Using suffix tree, it takes O(m) time to find occurrences of each 1-mismatch pattern In total, O(m 2 +occ) time where occ is total number of occurrences.

Any other solutions? Cobbs [CPM 1995] O(n log n) bit index, O(m 2 +occ) query time. Amir et al [Journal of Algorithm 2000] O(n log 3 n) bit index, O(m log n loglog n + occ) query time. Buchsbaum et al [ESA 2000] O(n log 2 n) bit index, O(m log log n + occ) query time. Cole et al [SODA 2004] O(n log 2 n) bit index, O(m + log log n + occ) query time. Trinh et al [CPM 2004] O(n log n) bit index, O(m log n + occ) query time. Lam et al [ISAAC 2005] O(n sqrt(log n)) bit index, O(m log log n + occ) query time. Chan et al [ESA 2006] O(n log n) bit index, O(m + log log n + occ) query time. Today, we have a look of the solution of Trinh et al.

Index Suffix array of T SA[1..n] Inversed suffix array of T SA -1 [1..n] where SA[SA -1 [i]] = i. Definition: We called range(T,P)=[st..ed] if P is a prefix of every T j for j=SA[st], SA[st+1], …, SA[ed] where T j = j suffix of T = T[j..n]. Example: For T= acacag$, range(T,ca)=[5..6] SA[i]Suffix 7$ 1acacag$ 3acag$ 5 ag$ 2cacag$ 4cag$ 6g$ i 1 2 3 4 5 6 7

Lemma 1 (Forward search) Assume [st..ed]=range(T,P). We can compute [st ’..ed ’ ]=range(T,Pc) in O(log n) time. Proof: binary search on SA.

Lemma 2 Assume [st 1..ed 1 ]=range(T,P 1 ) and [st 2..ed 2 ]=range(T,P 2 ). We can compute [st..ed]=range(T,P 1 P 2 ) in O(log n) time. Proof: Let the length of P 1 be k. Note that T SA[st1],T SA[st1+1], …,T SA[ed1] are lexicographically increasing. Hence, T SA[st1]+k,T SA[st1+1]+k, …,T SA[ed1]+k are lexicographically increasing. Thus, SA -1 [SA[st 1 ]+k] < SA -1 [SA[st 1 +1]+k] < … < SA -1 [SA[ed 1 ]+k]. To find st and ed, we need to find the smallest st such that st 2 < SA -1 [SA[st]+k]<ed 2 and the largest ed such that st 2 < SA -1 [SA[ed]+k] < ed 2. This can be done by binary search.

Example T=acacag$ P 1 =a, P 2 =ca. range(T,P 1 )=[2..4], range(T,P 2 )=[5..6] To find range(T,P 1 P 2 ), we do the following: Note that T SA[2] <T SA[3] <T SA[4] are beginning with a. So, T SA[2]+1 <T SA[3]+1 <T SA[4]+1. Note that SA -1 [SA[2]+1]=5, SA -1 [SA[3]+1]=6, SA -1 [SA[4]+1]=7. Hence, T SA[5] <T SA[6] <T SA[7]. Among the three suffixes, we need to identify suffix beginning with P 2. Since range(T,P 2 )=[5..6], both T SA[5] and T SA[6] contain P 2. As SA -1 [SA[2]+1]=5 and SA -1 [SA[3]+1]=6, we have range(T,P 1 P 2 ) = [2..3]. SA[i]Suffix 7$ 1acacag$ 3acag$ 5 ag$ 2cacag$ 4cag$ 6g$ i 1 2 3 4 5 6 7

Lemma 3 (Backward search) Assume [st..ed]=range(T,P). We can compute [st ’..ed ’ ]=range(T,cP) in O(log n) time. Proof: Let P 1 =c, P 2 =P. By Lemma 2, range(T,P 1 P 2 ) can be computed in O(log n) time.

Algorithm 1. For j=m,m-1, …,1, By backward search, find range(T,P[j..m]). 2. For j=1,2, …,m, By forward search, find range(T,P[1..j]). 3. Report all occurrences of range(T,P[1..m]) 4. For j=1,2, …,m, Let P 1 =P[1..j-1], P 2 =P[j+1..m] For every character c  P[j], By forward search, find range(T,P 1 c) By Lemma 2, find range(T,P 1 cP 2 ) Report all occurrences of range(T,P 1 cP 2 )

Time analysis Ignoring the O(occ) reporting time! Steps 1 and 2 take O(m log n) time. Step 4 tries all possible mismatches. There are in total |  |m mismatches. For each mismatch, it take O(log n) time. So, Step 4 takes O(|  |m log n) time. In total, the running time is O(|  |m log n + occ). Note that this solution can be generalized to handle k-mismatch or k-difference.

Algorithms in Bioinformatics: A Practical Introduction Suffix tree.

Similar presentations

Presentation on theme: "Algorithms in Bioinformatics: A Practical Introduction Suffix tree."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Algorithms in Bioinformatics: A Practical Introduction Suffix tree.

Similar presentations

Presentation on theme: "Algorithms in Bioinformatics: A Practical Introduction Suffix tree."— Presentation transcript:

Similar presentations

About project

Feedback