Exhaustive search (cont’d) CS 466 Saurabh Sinha
Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string (“motif”) count its occurrences in the promoters Report the most frequently occurring motif Does the true motif pop out ? Chapter
Simple statistics Consider 10 promoters, each 100 bp long Suppose a secret motif ATGCAACT has been “planted” in each promoter Our enumerative method counts every possible “8-mer” Expected number of occurrences of an 8-mer is 10 x 100 x (1/4) 8 ≈ Most likely, an arbitrary 8-mer will occur at most once, may be twice 10 occurrences of ATGCAACT will stand out
Variation in binding sites Motif occurrences will not always be exact copies of the consensus string The transcription factor can usually tolerate some variability in its binding sites It’s possible that none of the 10 occurrences of our motif ATGCAACT is actualy this precise string
A new motif model To define a motif, lets say we know where the motif occurrence starts in the sequence The motif start positions in their sequences can be represented as s = (s 1,s 2,s 3,…,s t )
Motifs: Profiles and Consensus a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A Profile C G T _________________ Consensus A C G T A C G T Line up the patterns by their start indexes s = (s 1, s 2, …, s t ) Construct matrix profile with frequencies of each nucleotide in columns Consensus nucleotide in each position has the highest score in column
Profile matrices Suppose there were t sequences to begin with Consider a column of a profile matrix The column may be (t, 0, 0, 0) –A perfectly conserved column The column may be (t/4, t/4, t/4, t/4) –A completely uniform column “Good” profile matrices should have more conserved columns
Scoring Motifs Given s = (s 1, … s t ) and DNA Score(s,DNA) = a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G _________________ A C G T _________________ Consensus a c g t a c g t Score = 30 l t
Good profile matrices Goal is to find the starting positions s=(s 1,…s t ) to maximize the score(s,DNA) of the resulting profile matrix This is one formulation of the “motif finding problem”
Another formulation Hamming distance between two strings v and w is d H (v,w) = number of mismatches between v and w Given an array of starting positions s=(s 1,…s t ), we define d H (v, s) = ∑ i d H (v,s i ) Define: TotalDist(v, DNA) = min s d H (v,s) Computing TotalDist is easy –find closest string to v in each input sequence
The median string problem Find v that minimizes TotalDist(v) A double minimization (min s, min v ) Equivalent to motif finding problem –Show this
Naïve time complexity Motif finding problem: Consider every (s 1,…s t ): O((n-l+1) t ) Median string problem: Consider every l-mer: O(4 l ). Relatively fast ! Common form of both expressions: Find a vector of L variables, each variable can take k values: O(k L )
An algorithm to enumerate ! Want to generate all strings in {1,2,3,4} L 11…11 11…12 11…13 11…14. 44…44 NEXTLEAF(a, L, k) for i := L to 1 if a i < k a i := a i +1 return a a i := 1 return a Increment the least significant digit; and “carry over” to next position if necessary ALLLEAVES(L, k) a := (1,..,1) while true output a a := NEXTLEAF(a,L,k) if a = (1,..,1) return
“Seach Tree” for enumeration -- Order of steps
Visiting all the vertices in tree Not just the leaves, but the internal nodes also –Why? Why not only the leaves of the tree? –We’ll see later. PreOrder traversal of a tree PreOrder(node): 1.Visit node 2.PreOrder(left child of node) 3.PreOrder(right child of node) How about a non- “recursive” version?
Visit the Next Vertex 1.NextVertex(a,i,L,k) // a : the array of digits 2. if i < L // i : prefix length 3. a i+1 1 // L: max length 4. return ( a,i+1) // k : max digit value 5. else 6. for j L to 1 7. if a j < k 8. a j a j return( a,j ) 10. return(a,0)
In words If at an internal node, just go one level deeper. If at a leaf node, –go to next leaf –if moved to a non-sibling in this process, jump up
Bypassing What if we wish to skip an entire subtree (at some internal node) during the tree traversal ? BYPASS(a, i, L, k) for j := i to 1 if a j < k a j := a j +1 return (a,j) return (a,0)
Brute Force Solution for the Motif finding problem 1. BruteForceMotifSearchAgain(DNA, t, n, l) 2.s (1,1,…, 1) 3.bestScore Score(s,DNA) 4.while forever 5.s NextLeaf (s, t, n-l+1) 6.if (Score(s,DNA) > bestScore) 7.bestScore Score(s, DNA) 8.bestMotif (s 1,s 2,..., s t ) 9.return bestMotif O(l(n-l+1) t )
Can We Do Better? Sets of s=(s 1, s 2, …,s t ) may have a weak profile for the first i positions (s 1, s 2, …,s i ) Every row of alignment may add at most l to Score Optimism: if all subsequent (t-i) positions (s i+1, …s t ) add (t – i ) * l to Score(s,i,DNA) If Score(s,i,DNA) + (t – i ) * l < BestScore, it makes no sense to search in vertices of the current subtree –Use ByPass() “Branch and bound” strategy –This saves us from looking at (n – l + 1) t-i leaves
Pseudocode for Branch and Bound Motif Search 1.BranchAndBoundMotifSearch(DNA,t,n, l ) 2.s (1,…,1) 3.bestScore 0 4.i 1 5.while i > 0 6.if i < t 7.optimisticScore Score(s, i, DNA) +(t – i ) * l 8.if optimisticScore < bestScore 9. (s, i) Bypass(s,i, n- l +1) 10.else 11. (s, i) NextVertex(s, i, n- l +1) 12.else 13.if Score(s,DNA) > bestScore 14. bestScore Score(s) 15. bestMotif (s 1, s 2, s 3, …, s t ) 16. (s,i) NextVertex(s,i,t,n- l + 1) 17.return bestMotif
The median string problem Enumerate 4 l strings v For each v, compute TotalDist(v, DNA) –This requires linear scan of DNA, i.e., O(nt) Overall: O(nt4 l ) Improvement by branch and bound ? During enumeration of l-mers, suppose we are at some prefix v’, and find that TotalDist(v’,DNA) > BestDistanceSoFar. Why enumerate further ?
Bounded Median String Search 1.BranchAndBoundMedianStringSearch(DNA,t,n, l ) 2.s (1,…,1) 3.bestDistance ∞ 4. i 1 5.while i > 0 6. if i < l 7. prefix string corresponding to the first i nucleotides of s 8. optimisticDistance TotalDistance(prefix,DNA) 9. if optimisticDistance > bestDistance 10. (s, i ) Bypass(s,i, l, 4) 11. else 12. (s, i ) NextVertex(s, i, l, 4) 13. else 14. word nucleotide string corresponding to s 15. if TotalDistance(s,DNA) < bestDistance 16. bestDistance TotalDistance(word, DNA) 17. bestWord word 18. (s,i ) NextVertex(s,i, l, 4) 19.return bestWord
Greedy Algorithms
A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of length l. Enumerative approach O(l n t ) –Impractical Instead consider a more practical algorithm called “GREEDYMOTIFSEARCH” Chapter 5.5
Greedy Motif Search Find two closest l-mers in sequences 1 and 2 and form 2 x l alignment matrix with Score(s,2,DNA) At each of the following t-2 iterations, finds a “best” l-mer in sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences In other words, it finds an l-mer in sequence i maximizing Score(s,i,DNA) under the assumption that the first (i-1) l-mers have been already chosen Sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first 2 l-mers
Greedy Motif Search pseudocode GREEDYMOTIFSEARCH (DNA, t, n, l) bestMotif := (1,…,1) s := (1,…,1) for s 1 =1 to n-l+1 for s 2 = 1 to n-l+1 if (Score(s,2,DNA) > Score(bestMotif,2,DNA) bestMotif 1 := s 1 bestMotif 2 := s 2 s 1 := bestMotif 1 ; s 2 := bestMotif 2 for i = 3 to t for s i = 1 to n-l+1 if (Score(s,i,DNA) > Score(bestMotif,i,DNA) bestMotif i := s i s i := bestMotif i Return bestMotif
A digression Score of a profile matrix looks only at the “majority” base in each column, not at the entire distribution The issue of non-uniform “background” frequencies of bases in the genome A better “score” of a profile matrix ?
Information Content First convert a “profile matrix” to a “position weight matrix” or PWM –Convert frequencies to probabilities PWM W: W k = frequency of base at position k q = frequency of base by chance Information content of W:
Information Content If W k is always equal to q , i.e., if W is similar to random sequence, information content of W is 0. If W is different from q, information content is high.
Greedy Motif Search Can be trivially modified to use “Information Content” as the score Use statistical criteria to evaluate significance of Information Content At each step, instead of choosing the top (1) partial motif, keep the top k partial motifs –“Beam search” The program “CONSENSUS” from Stormo lab. Further Reading: Hertz, Hartzell & Stormo, CABIOS (1990)
More on Greedy algorithms in next lecture