Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string.

Slides:



Advertisements
Similar presentations
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advertisements

Multiplying Matrices Two matrices, A with (p x q) matrix and B with (q x r) matrix, can be multiplied to get C with dimensions p x r, using scalar multiplications.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms CS 466 Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Greedy Algorithms CS 6030 by Savitha Parur Venkitachalam.
Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.
Techniques for Dealing with Hard Problems Backtrack: –Systematically enumerates all potential solutions by continually trying to extend a partial solution.
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
MAE 552 – Heuristic Optimization Lecture 26 April 1, 2002 Topic:Branch and Bound.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Finding Regulatory Motifs in DNA Sequences
Backtracking.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Finding Regulatory Motifs in DNA Sequences An Introduction to Bioinformatics Algorithms (Jones and Pevzner)
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
Analysis of Algorithms CS 477/677
Outline More exhaustive search algorithms Today: Motif finding
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Finding Regulatory Motifs in DNA Sequences
Motif Finding [1]: Ch , , 5.5,
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
1 CS 430: Information Discovery Lecture 5 Ranking.
Chapter 13 Backtracking Introduction The 3-coloring problem
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Greedy Technique.
Advanced Algorithms Analysis and Design
13 Text Processing Hongfei Yan June 1, 2016.
Learning Sequence Motif Models Using Expectation Maximization (EM)
CS 581 Tandy Warnow.
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Exhaustive search (cont’d) CS 466 Saurabh Sinha

Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string (“motif”) count its occurrences in the promoters Report the most frequently occurring motif Does the true motif pop out ? Chapter

Simple statistics Consider 10 promoters, each 100 bp long Suppose a secret motif ATGCAACT has been “planted” in each promoter Our enumerative method counts every possible “8-mer” Expected number of occurrences of an 8-mer is 10 x 100 x (1/4) 8 ≈ Most likely, an arbitrary 8-mer will occur at most once, may be twice 10 occurrences of ATGCAACT will stand out

Variation in binding sites Motif occurrences will not always be exact copies of the consensus string The transcription factor can usually tolerate some variability in its binding sites It’s possible that none of the 10 occurrences of our motif ATGCAACT is actualy this precise string

A new motif model To define a motif, lets say we know where the motif occurrence starts in the sequence The motif start positions in their sequences can be represented as s = (s 1,s 2,s 3,…,s t )

Motifs: Profiles and Consensus a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A Profile C G T _________________ Consensus A C G T A C G T Line up the patterns by their start indexes s = (s 1, s 2, …, s t ) Construct matrix profile with frequencies of each nucleotide in columns Consensus nucleotide in each position has the highest score in column

Profile matrices Suppose there were t sequences to begin with Consider a column of a profile matrix The column may be (t, 0, 0, 0) –A perfectly conserved column The column may be (t/4, t/4, t/4, t/4) –A completely uniform column “Good” profile matrices should have more conserved columns

Scoring Motifs Given s = (s 1, … s t ) and DNA Score(s,DNA) = a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G _________________ A C G T _________________ Consensus a c g t a c g t Score = 30 l t

Good profile matrices Goal is to find the starting positions s=(s 1,…s t ) to maximize the score(s,DNA) of the resulting profile matrix This is one formulation of the “motif finding problem”

Another formulation Hamming distance between two strings v and w is d H (v,w) = number of mismatches between v and w Given an array of starting positions s=(s 1,…s t ), we define d H (v, s) = ∑ i d H (v,s i ) Define: TotalDist(v, DNA) = min s d H (v,s) Computing TotalDist is easy –find closest string to v in each input sequence

The median string problem Find v that minimizes TotalDist(v) A double minimization (min s, min v ) Equivalent to motif finding problem –Show this

Naïve time complexity Motif finding problem: Consider every (s 1,…s t ): O((n-l+1) t ) Median string problem: Consider every l-mer: O(4 l ). Relatively fast ! Common form of both expressions: Find a vector of L variables, each variable can take k values: O(k L )

An algorithm to enumerate ! Want to generate all strings in {1,2,3,4} L 11…11 11…12 11…13 11…14. 44…44 NEXTLEAF(a, L, k) for i := L to 1 if a i < k a i := a i +1 return a a i := 1 return a Increment the least significant digit; and “carry over” to next position if necessary ALLLEAVES(L, k) a := (1,..,1) while true output a a := NEXTLEAF(a,L,k) if a = (1,..,1) return

“Seach Tree” for enumeration -- Order of steps

Visiting all the vertices in tree Not just the leaves, but the internal nodes also –Why? Why not only the leaves of the tree? –We’ll see later. PreOrder traversal of a tree PreOrder(node): 1.Visit node 2.PreOrder(left child of node) 3.PreOrder(right child of node) How about a non- “recursive” version?

Visit the Next Vertex 1.NextVertex(a,i,L,k) // a : the array of digits 2. if i < L // i : prefix length 3. a i+1  1 // L: max length 4. return ( a,i+1) // k : max digit value 5. else 6. for j  L to 1 7. if a j < k 8. a j  a j return( a,j ) 10. return(a,0)

In words If at an internal node, just go one level deeper. If at a leaf node, –go to next leaf –if moved to a non-sibling in this process, jump up

Bypassing What if we wish to skip an entire subtree (at some internal node) during the tree traversal ? BYPASS(a, i, L, k) for j := i to 1 if a j < k a j := a j +1 return (a,j) return (a,0)

Brute Force Solution for the Motif finding problem 1. BruteForceMotifSearchAgain(DNA, t, n, l) 2.s  (1,1,…, 1) 3.bestScore  Score(s,DNA) 4.while forever 5.s  NextLeaf (s, t, n-l+1) 6.if (Score(s,DNA) > bestScore) 7.bestScore  Score(s, DNA) 8.bestMotif  (s 1,s 2,..., s t ) 9.return bestMotif O(l(n-l+1) t )

Can We Do Better? Sets of s=(s 1, s 2, …,s t ) may have a weak profile for the first i positions (s 1, s 2, …,s i ) Every row of alignment may add at most l to Score Optimism: if all subsequent (t-i) positions (s i+1, …s t ) add (t – i ) * l to Score(s,i,DNA) If Score(s,i,DNA) + (t – i ) * l < BestScore, it makes no sense to search in vertices of the current subtree –Use ByPass() “Branch and bound” strategy –This saves us from looking at (n – l + 1) t-i leaves

Pseudocode for Branch and Bound Motif Search 1.BranchAndBoundMotifSearch(DNA,t,n, l ) 2.s  (1,…,1) 3.bestScore  0 4.i  1 5.while i > 0 6.if i < t 7.optimisticScore  Score(s, i, DNA) +(t – i ) * l 8.if optimisticScore < bestScore 9. (s, i)  Bypass(s,i, n- l +1) 10.else 11. (s, i)  NextVertex(s, i, n- l +1) 12.else 13.if Score(s,DNA) > bestScore 14. bestScore  Score(s) 15. bestMotif  (s 1, s 2, s 3, …, s t ) 16. (s,i)  NextVertex(s,i,t,n- l + 1) 17.return bestMotif

The median string problem Enumerate 4 l strings v For each v, compute TotalDist(v, DNA) –This requires linear scan of DNA, i.e., O(nt) Overall: O(nt4 l ) Improvement by branch and bound ? During enumeration of l-mers, suppose we are at some prefix v’, and find that TotalDist(v’,DNA) > BestDistanceSoFar. Why enumerate further ?

Bounded Median String Search 1.BranchAndBoundMedianStringSearch(DNA,t,n, l ) 2.s  (1,…,1) 3.bestDistance  ∞ 4. i  1 5.while i > 0 6. if i < l 7. prefix  string corresponding to the first i nucleotides of s 8. optimisticDistance  TotalDistance(prefix,DNA) 9. if optimisticDistance > bestDistance 10. (s, i )  Bypass(s,i, l, 4) 11. else 12. (s, i )  NextVertex(s, i, l, 4) 13. else 14. word  nucleotide string corresponding to s 15. if TotalDistance(s,DNA) < bestDistance 16. bestDistance  TotalDistance(word, DNA) 17. bestWord  word 18. (s,i )  NextVertex(s,i, l, 4) 19.return bestWord

Greedy Algorithms

A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of length l. Enumerative approach O(l n t ) –Impractical Instead consider a more practical algorithm called “GREEDYMOTIFSEARCH” Chapter 5.5

Greedy Motif Search Find two closest l-mers in sequences 1 and 2 and form 2 x l alignment matrix with Score(s,2,DNA) At each of the following t-2 iterations, finds a “best” l-mer in sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences In other words, it finds an l-mer in sequence i maximizing Score(s,i,DNA) under the assumption that the first (i-1) l-mers have been already chosen Sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first 2 l-mers

Greedy Motif Search pseudocode GREEDYMOTIFSEARCH (DNA, t, n, l) bestMotif := (1,…,1) s := (1,…,1) for s 1 =1 to n-l+1 for s 2 = 1 to n-l+1 if (Score(s,2,DNA) > Score(bestMotif,2,DNA) bestMotif 1 := s 1 bestMotif 2 := s 2 s 1 := bestMotif 1 ; s 2 := bestMotif 2 for i = 3 to t for s i = 1 to n-l+1 if (Score(s,i,DNA) > Score(bestMotif,i,DNA) bestMotif i := s i s i := bestMotif i Return bestMotif

A digression Score of a profile matrix looks only at the “majority” base in each column, not at the entire distribution The issue of non-uniform “background” frequencies of bases in the genome A better “score” of a profile matrix ?

Information Content First convert a “profile matrix” to a “position weight matrix” or PWM –Convert frequencies to probabilities PWM W: W  k = frequency of base  at position k q  = frequency of base  by chance Information content of W:

Information Content If W  k is always equal to q , i.e., if W is similar to random sequence, information content of W is 0. If W is different from q, information content is high.

Greedy Motif Search Can be trivially modified to use “Information Content” as the score Use statistical criteria to evaluate significance of Information Content At each step, instead of choosing the top (1) partial motif, keep the top k partial motifs –“Beam search” The program “CONSENSUS” from Stormo lab. Further Reading: Hertz, Hartzell & Stormo, CABIOS (1990)

More on Greedy algorithms in next lecture