Presentation is loading. Please wait.

Presentation is loading. Please wait.

6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.

Similar presentations


Presentation on theme: "6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut."— Presentation transcript:

1 6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut

2 6/29/20152 Problem1 Definition Input: n sequences of length m each, integers l and d, s.t. l << m and d < l. Each input sequence has an occurrence of a motif M of length l at a Hamming Distance of d from M. Output: M The above problem is known as the Planted (l, d) Motif Problem.

3 6/29/20153 Problem2 Definition Input is a database DB of n sequences, integers l, d, and q. Output should be all the patterns in DB such that each pattern is of length l and it occurs in at least q of the n sequences. A pattern u is considered an occurrence of another pattern v as long as the edit distance between u and v is at most d.

4 6/29/20154 Problem 1: State of the Art Two kinds of algorithms are known: Approximate and Exact. WINNOWER (Pevzner and Sze[2000]) and PROJECTION (Buhler and Tompa[2001]) are approximate algorithms. MITRA (Eskin and Pevzner [2002]) is an exact algorithm.

5 6/29/20155 A Probabilistic Analysis Problem1 is complicated by the fact that, for a given value of l, the higher the value of d, the higher the expected number of motifs that occur by random chance. For instance, when n=20, m=600, l=9, d=2, the expected number of spurious motifs is 1.6. On the other hand for n=20, m=600, l=10, d=2, the expected number of spurious motifs is only 6.1 X 10 -8.

6 6/29/20156 WINNOWER Generate all l-mers from out of all the input sequences. The number of such l-mers is O(nm). Generate a graph G(V,E). Each l-mer is a node in G. Two nodes are connected if the hamming distance between them is at most 2d. Find all cliques in the graph. Process these cliques to identify M.

7 6/29/20157 WINNOWER Details Pevzner and Sze observe that the graph G constructed above is 'almost random' and is multipartite. They use the notion of an extendable clique. If Q is any clique, node u is called a neighbor of Q if the nodes in Q and u also form a clique. A clique is called extendable if it has at least one neighbor in every part of the multipartite graph G. The algorithm WINNOWER is based on the observation that every edge in a maximal n-clique belongs to at least (n-2) extendable cliques of size k. This (k-2) observation is used to eliminate edges.

8 6/29/20158 PROJECTION Let C be the collection of all l-mers in the input. Project these l-mers along k randomly chosen columns. (k is typically 7). Group the k-mers such that equal k-mers are in the same group. If a group is of size greater than a threshold s (s is typically 3), then M is likely to have this k-mer. The rest of M is computed using maximum likelihood estimates.

9 6/29/20159 MITRA MITRA is based on WINNOWER; Uses pairwise similarity information. MITRA uses a mismatch tree data structure and splits the space of all possible patterns into disjoint subspaces that start with a given prefix. Pruning is applied in each subspace.

10 6/29/201510 Pattern Branching One way of solving the planted motif search problem is to start from each l-mer in the input, search the neighbors of this l-mer, score them appropriately and output the best scoring neighbor. Pattern Branching only examines a selected subset of neighbors of any l-mer u of the input and hence is more efficient. For any l-mer u, let D i (u) stand for the set of neighbors of u that are at a hamming distance of i. For any input sequence S j let d(u,S j ) denote the minimum hamming distance between u and any l-mer of S j. Let d(u,S)=Σ n j=1 d(u,S j ).

11 6/29/201511 Pattern Branching Contd… For any l-mer u in the input let BestNeighbor(u) stand for the neighbor v in D 1 (u) whose distance d(v,S) is minimum from among all the elements of D 1 (u). The PatternBranching algorithm starts from a u, identifies u 1 = BestNeighbor(u); Then it identifies u 2 =BestNeighbor(u 1 ); and so on. It finally outputs u d. The best u d from among all possible u's is output.

12 6/29/201512 A Simple Algorithm 1) 1)Form all possible l-mers from the input sequences. Let C be this collection. Let C’ be the collection of l-mers in the first input sequence. 2) For every u in C’ generate all l-mers that are at a hamming distance of d from u. Let C’’ be the collection of these l-mers. Note that C’’ contains M. 3) For every pair of l-mers (u, v) with u in C and v in C’’ compute the hamming distance between u and v. Output that l-mer of C’’ that has a neighbor (i.e., an l-mer at a hamming distance of d) in each one of the n input sequences.

13 6/29/201513 A Simple Algorithm Contd… The run time of the above algorithm is                  d d l nm 2 l O ||

14 6/29/201514 PMS1 1) Generate all possible l-mers from out of each of the n input sequences. Let C i be the collection of l-mers from the i-th sequence. 2) For each C i and each u in C i do: Generate all l-mers v such that u and v are at a hamming distance of d. Let C i ’ be the neighbors of C i. 3) Sort all the l-mers in every C i. Let L i be the sorted list corresponding to C i. 4) Merge all the L i ’s and output the generated (in step 2) l-mer that occurs in all the L i ’s.

15 6/29/201515 PMS1 Contd… The run time of PMS1 is: (Here w is the word length of the computer. Radix sort is used.)

16 6/29/201516 PMS2 Note that if M occurs in every input sequence, then every substring of M also occurs in every input sequence. In particular, there are at least l - k + 1 k-mers (for d <= k <= l) such that each of these occurs in every input sequence at a hamming distance of at most d. Let Q be the collection of k-mers that can be formed out of M. There are l - k + 1 k-mers in Q. Each one of these k-mers will be present in each input sequence at a hamming distance of at most d.

17 6/29/201517 PMS3 This algorithm enables one to handle large values of d. Let d’=d/2. Let M be the motif of interest with |M|=l=2l’ for some integer l’. Let M’ refer to the first half of M and M’’ to the second half. We know that M occurs in every input sequence. Let S be an arbitrary input sequence and let p be the occurrence of M in S. If p’ and p’’ are the two halves of p, then, either (1) the hamming distance between M’ and p’ is at most d’ or (2) the hamming distance between M’’ and p’’ is at most d’.

18 6/29/201518 PMS3 Contd… Also, note that in every input sequence either M’ occurs with a hamming distance of at most d’ or M’’ occurs with a hamming distance of at most d’. As a result, in at least n/2 sequences either M’ occurs with a hamming distance of at most d’ or M’’ occurs with a hamming distance of at most d’. PMS3 exploits these observations.

19 6/29/201519 Experimental Data ldTldTldT 921.44 1020.84 1120.7811319.84 1220.8412315.53 1320.7013320.98134228.94 1421.0514320.38144226.83 1521.3315320.53154217.34 1622.6116321.20164216.92

20 6/29/201520 A Comparison with MITRA For l=11 and d=2, MITRA takes one minute whereas PMS2 takes around a second. For l=12 and d=3, two versions of MITRA take one minute and four minutes, respectively. PMS2 takes 15.53 seconds. For l=14 and d=4, two versions of MITRA take 4 minutes and 10 minutes, respectively. PMS2 takes 226.83 seconds.

21 6/29/201521 Known Algorithms for Problem 2 Sagot [1998]’s algorithm runs in time O(n 2 ml d |Σ| d ) and is based on generalized suffix trees. Space used is O(n 2 m/w) where w is the word length of the computer. This algorithm builds a suffix tree on the given sequences in O(nm) time using O(nm) space. If u is any l-mer present in the input, there are O(l d (|Σ|-1) d ) possible neighbors for u. Any of these neighbors could potentially be a motif of interest. Since there are O(nm) l-mers in the input, the number of such neighbors is O(nml d (|Σ|-1) d ).

22 6/29/201522 Sagot’s Algorithm Contd… This algorithm, for each such neighbor v, walks through the tree to check if v is a possible answer. This walking step is referred to as 'spelling'. The spelling operation takes a total of O(n 2 ml d (|Σ|-1) d ) time using an additional O(nm) space. When employed for solving Problem 2, the same algorithm takes O(n 2 ml d |Σ| d ) time. The algorithm of Adebiyi and Kaufmann [2002] takes an expected O(nm+d(nm) 1.9 log nm) time.

23 6/29/201523 An Algorithm Similar to PMS1 The basic idea behind the algorithm is: We generate all possible l- mers in the database. There are at most mn such l-mers and these are the patterns of interest. For each such l-mer we want to determine if it occurs in at least q of the input sequences. Let u be one of the above l-mers. If v is a string such that the edit distance between u and v is at most d, then we say v is a neighbor of u. We generate all the neighbors of u. For each neighbor v of u we determine a list of input sequences in which v is present. These lists (over all possible neighbors of u) are then merged to obtain a list of input sequences in which u occurs (within an edit distance of d).

24 6/29/201524 New Algorithm Contd… The above algorithm runs in time O(n 2 ml d |Σ| d ). The space used is O(nmd+l d |Σ| d ). Space used is less than those of prior algorithms. Only arrays are used in the new algorithm. The underlying constant is small and hence will potentially perform better in practice than Sagot’s algorithm.

25 6/29/201525 Thank You.


Download ppt "6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut."

Similar presentations


Ads by Google