Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string.

Exhaustive search (cont’d) CS 466 Saurabh Sinha

Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string (“motif”) count its occurrences in the promoters Report the most frequently occurring motif Does the true motif pop out ? Chapter 4.5-4.9

Simple statistics Consider 10 promoters, each 100 bp long Suppose a secret motif ATGCAACT has been “planted” in each promoter Our enumerative method counts every possible “8-mer” Expected number of occurrences of an 8-mer is 10 x 100 x (1/4) 8 ≈ 0.015 Most likely, an arbitrary 8-mer will occur at most once, may be twice 10 occurrences of ATGCAACT will stand out

Variation in binding sites Motif occurrences will not always be exact copies of the consensus string The transcription factor can usually tolerate some variability in its binding sites It’s possible that none of the 10 occurrences of our motif ATGCAACT is actualy this precise string

A new motif model To define a motif, lets say we know where the motif occurrence starts in the sequence The motif start positions in their sequences can be represented as s = (s 1,s 2,s 3,…,s t )

Motifs: Profiles and Consensus a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 3 1 1 0 Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 _________________ Consensus A C G T A C G T Line up the patterns by their start indexes s = (s 1, s 2, …, s t ) Construct matrix profile with frequencies of each nucleotide in columns Consensus nucleotide in each position has the highest score in column

Profile matrices Suppose there were t sequences to begin with Consider a column of a profile matrix The column may be (t, 0, 0, 0) –A perfectly conserved column The column may be (t/4, t/4, t/4, t/4) –A completely uniform column “Good” profile matrices should have more conserved columns

Scoring Motifs Given s = (s 1, … s t ) and DNA Score(s,DNA) = a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 3 1 1 0 C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 _________________ Consensus a c g t a c g t Score 3+4+4+5+3+4+3+4= 30 l t

Good profile matrices Goal is to find the starting positions s=(s 1,…s t ) to maximize the score(s,DNA) of the resulting profile matrix This is one formulation of the “motif finding problem”

Another formulation Hamming distance between two strings v and w is d H (v,w) = number of mismatches between v and w Given an array of starting positions s=(s 1,…s t ), we define d H (v, s) = ∑ i d H (v,s i ) Define: TotalDist(v, DNA) = min s d H (v,s) Computing TotalDist is easy –find closest string to v in each input sequence

The median string problem Find v that minimizes TotalDist(v) A double minimization (min s, min v ) Equivalent to motif finding problem –Show this

Naïve time complexity Motif finding problem: Consider every (s 1,…s t ): O((n-l+1) t ) Median string problem: Consider every l-mer: O(4 l ). Relatively fast ! Common form of both expressions: Find a vector of L variables, each variable can take k values: O(k L )

An algorithm to enumerate ! Want to generate all strings in {1,2,3,4} L 11…11 11…12 11…13 11…14. 44…44 NEXTLEAF(a, L, k) for i := L to 1 if a i < k a i := a i +1 return a a i := 1 return a Increment the least significant digit; and “carry over” to next position if necessary ALLLEAVES(L, k) a := (1,..,1) while true output a a := NEXTLEAF(a,L,k) if a = (1,..,1) return

“Seach Tree” for enumeration -- Order of steps 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 1- 2- 3- 4- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 23 4 1 23 4

Visiting all the vertices in tree Not just the leaves, but the internal nodes also –Why? Why not only the leaves of the tree? –We’ll see later. PreOrder traversal of a tree PreOrder(node): 1.Visit node 2.PreOrder(left child of node) 3.PreOrder(right child of node) How about a non- “recursive” version?

Visit the Next Vertex 1.NextVertex(a,i,L,k) // a : the array of digits 2. if i < L // i : prefix length 3. a i+1  1 // L: max length 4. return ( a,i+1) // k : max digit value 5. else 6. for j  L to 1 7. if a j < k 8. a j  a j +1 9. return( a,j ) 10. return(a,0)

In words If at an internal node, just go one level deeper. If at a leaf node, –go to next leaf –if moved to a non-sibling in this process, jump up

Bypassing What if we wish to skip an entire subtree (at some internal node) during the tree traversal ? BYPASS(a, i, L, k) for j := i to 1 if a j < k a j := a j +1 return (a,j) return (a,0)

Brute Force Solution for the Motif finding problem 1. BruteForceMotifSearchAgain(DNA, t, n, l) 2.s  (1,1,…, 1) 3.bestScore  Score(s,DNA) 4.while forever 5.s  NextLeaf (s, t, n-l+1) 6.if (Score(s,DNA) > bestScore) 7.bestScore  Score(s, DNA) 8.bestMotif  (s 1,s 2,..., s t ) 9.return bestMotif O(l(n-l+1) t )

Can We Do Better? Sets of s=(s 1, s 2, …,s t ) may have a weak profile for the first i positions (s 1, s 2, …,s i ) Every row of alignment may add at most l to Score Optimism: if all subsequent (t-i) positions (s i+1, …s t ) add (t – i ) * l to Score(s,i,DNA) If Score(s,i,DNA) + (t – i ) * l < BestScore, it makes no sense to search in vertices of the current subtree –Use ByPass() “Branch and bound” strategy –This saves us from looking at (n – l + 1) t-i leaves

Pseudocode for Branch and Bound Motif Search 1.BranchAndBoundMotifSearch(DNA,t,n, l ) 2.s  (1,…,1) 3.bestScore  0 4.i  1 5.while i > 0 6.if i < t 7.optimisticScore  Score(s, i, DNA) +(t – i ) * l 8.if optimisticScore < bestScore 9. (s, i)  Bypass(s,i, n- l +1) 10.else 11. (s, i)  NextVertex(s, i, n- l +1) 12.else 13.if Score(s,DNA) > bestScore 14. bestScore  Score(s) 15. bestMotif  (s 1, s 2, s 3, …, s t ) 16. (s,i)  NextVertex(s,i,t,n- l + 1) 17.return bestMotif

The median string problem Enumerate 4 l strings v For each v, compute TotalDist(v, DNA) –This requires linear scan of DNA, i.e., O(nt) Overall: O(nt4 l ) Improvement by branch and bound ? During enumeration of l-mers, suppose we are at some prefix v’, and find that TotalDist(v’,DNA) > BestDistanceSoFar. Why enumerate further ?

Bounded Median String Search 1.BranchAndBoundMedianStringSearch(DNA,t,n, l ) 2.s  (1,…,1) 3.bestDistance  ∞ 4. i  1 5.while i > 0 6. if i < l 7. prefix  string corresponding to the first i nucleotides of s 8. optimisticDistance  TotalDistance(prefix,DNA) 9. if optimisticDistance > bestDistance 10. (s, i )  Bypass(s,i, l, 4) 11. else 12. (s, i )  NextVertex(s, i, l, 4) 13. else 14. word  nucleotide string corresponding to s 15. if TotalDistance(s,DNA) < bestDistance 16. bestDistance  TotalDistance(word, DNA) 17. bestWord  word 18. (s,i )  NextVertex(s,i, l, 4) 19.return bestWord

Greedy Algorithms

A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of length l. Enumerative approach O(l n t ) –Impractical Instead consider a more practical algorithm called “GREEDYMOTIFSEARCH” Chapter 5.5

Greedy Motif Search Find two closest l-mers in sequences 1 and 2 and form 2 x l alignment matrix with Score(s,2,DNA) At each of the following t-2 iterations, finds a “best” l-mer in sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences In other words, it finds an l-mer in sequence i maximizing Score(s,i,DNA) under the assumption that the first (i-1) l-mers have been already chosen Sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first 2 l-mers

Greedy Motif Search pseudocode GREEDYMOTIFSEARCH (DNA, t, n, l) bestMotif := (1,…,1) s := (1,…,1) for s 1 =1 to n-l+1 for s 2 = 1 to n-l+1 if (Score(s,2,DNA) > Score(bestMotif,2,DNA) bestMotif 1 := s 1 bestMotif 2 := s 2 s 1 := bestMotif 1 ; s 2 := bestMotif 2 for i = 3 to t for s i = 1 to n-l+1 if (Score(s,i,DNA) > Score(bestMotif,i,DNA) bestMotif i := s i s i := bestMotif i Return bestMotif

A digression Score of a profile matrix looks only at the “majority” base in each column, not at the entire distribution The issue of non-uniform “background” frequencies of bases in the genome A better “score” of a profile matrix ?

Information Content First convert a “profile matrix” to a “position weight matrix” or PWM –Convert frequencies to probabilities PWM W: W  k = frequency of base  at position k q  = frequency of base  by chance Information content of W:

Information Content If W  k is always equal to q , i.e., if W is similar to random sequence, information content of W is 0. If W is different from q, information content is high.

Greedy Motif Search Can be trivially modified to use “Information Content” as the score Use statistical criteria to evaluate significance of Information Content At each step, instead of choosing the top (1) partial motif, keep the top k partial motifs –“Beam search” The program “CONSENSUS” from Stormo lab. Further Reading: Hertz, Hartzell & Stormo, CABIOS (1990) http://bioinf.kvl.dk/~gorodkin/teach/bioinf2004/hertz90.pdf

More on Greedy algorithms in next lecture

Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string.

Similar presentations

Presentation on theme: "Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string.

Similar presentations

Presentation on theme: "Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string."— Presentation transcript:

Similar presentations

About project

Feedback