Finding Regulatory Motifs in DNA Sequences

Slides:



Advertisements
Similar presentations
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Advertisements

Longest Common Subsequence
Traveling Salesperson Problem
Greedy Algorithms CS 466 Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Greedy Algorithms CS 6030 by Savitha Parur Venkitachalam.
Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Phylogenetic Trees Lecture 4
Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
MAE 552 – Heuristic Optimization Lecture 27 April 3, 2002
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong
MAE 552 – Heuristic Optimization Lecture 26 April 1, 2002 Topic:Branch and Bound.
Homework page 102 questions 1, 4, and 10 page 106 questions 4 and 5 page 111 question 1 page 119 question 9.
Ch 13 – Backtracking + Branch-and-Bound
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
ASC Program Example Part 3 of Associative Computing Examining the MST code in ASC Primer.
Finding Regulatory Motifs in DNA Sequences
Backtracking.
Motif Refinement using Hybrid Expectation Maximization Algorithm Chandan Reddy Yao-Chung Weng Hsiao-Dong Chiang School of Electrical and Computer Engr.
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Finding Regulatory Motifs in DNA Sequences An Introduction to Bioinformatics Algorithms (Jones and Pevzner)
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
BackTracking CS335. N-Queens The object is to place queens on a chess board in such as way as no queen can capture another one in a single move –Recall.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Outline More exhaustive search algorithms Today: Motif finding
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
Greedy Methods and Backtracking Dr. Marina Gavrilova Computer Science University of Calgary Canada.
5-1-1 CSC401 – Analysis of Algorithms Chapter 5--1 The Greedy Method Objectives Introduce the Brute Force method and the Greedy Method Compare the solutions.
Lecture 5 Motif discovery. Signals in DNA Genes Promoter regions Binding sites for regulatory proteins (transcription factors, enhancer modules, motifs)
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string.
Introduction to Bioinformatics Algorithms Randomized Algorithms and Motif Finding.
Motif Finding [1]: Ch , , 5.5,
Randomized Algorithms Chapter 12 Jason Eric Johnson Presentation #3 CS Bioinformatics.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
CSCE350 Algorithms and Data Structure Lecture 21 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Randomized Algorithms for Motif Finding [1] Ch 12.2.
Optimization Problems The previous examples simply find a single solution What if we have a cost associated with each solution and we want to find the.
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
BackTracking CS255.
CSE 5290: Algorithms for Bioinformatics Fall 2011
Probabilities and Probabilistic Models
Finding Regulatory Motifs in DNA Sequences
BNFO 602 Phylogenetics Usman Roshan.
3. Brute Force Selection sort Brute-Force string matching
CSE 5290: Algorithms for Bioinformatics Fall 2009
3. Brute Force Selection sort Brute-Force string matching
Backtracking and Branch-and-Bound
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

Finding Regulatory Motifs in DNA Sequences Lecture 10 – Branch and Bound

Revisiting Brute Force Search Now that we have method for navigating the tree, lets look again at BruteForceMotifSearch

Brute Force Search Again BruteForceMotifSearchAgain(DNA, t, n, l) s  (1,1,…, 1) bestMotif  (s1,s2 , . . . , st) bestScore  Score(s,DNA) while forever s  NextLeaf (s, t, n- l +1) if (Score(s,DNA) > bestScore) bestScore  Score(s, DNA) return bestMotif

Can We Do Better? Sets of s=(s1, s2, …,st) may have a weak profile for the first i positions (s1, s2, …,si) Every row of alignment may add at most l to Score Optimism: if all subsequent (t-i) positions (si+1, …st) add (t – i ) * l to Score(s,i,DNA) If Score(s,i,DNA) + (t – i ) * l < BestScore, it makes no sense to search in vertices of the current subtree Use ByPass()

Branch and Bound Algorithm for Motif Search Since each level of the tree goes deeper into search, discarding a prefix discards all following branches This saves us from looking at (n – l + 1)t-i leaves Use NextVertex() and ByPass() to navigate the tree

Pseudocode for Branch and Bound Motif Search BranchAndBoundMotifSearch(DNA,t,n,l) s  (1,…,1) bestScore  0 i  1 while i > 0 if i < t optimisticScore  Score(s, i, DNA) +(t – i ) * l if optimisticScore < bestScore (s, i)  Bypass(s,i, n-l +1) else (s, i)  NextVertex(s, i, n-l +1) if Score(s,DNA) > bestScore bestScore  Score(s) bestMotif  (s1, s2, s3, …, st) (s,i)  NextVertex(s,i,t,n-l + 1) return bestMotif

Median String Search Improvements Recall the computational differences between motif search and median string search The Motif Finding Problem needs to examine all (n-l +1)t combinations for s. The Median String Problem needs to examine 4l combinations of v. This number is relatively small We want to use median string algorithm with the Branch and Bound trick!

Branch and Bound Applied to Median String Search Note that if the total distance for a prefix is greater than that for the best word so far: TotalDistance (prefix, DNA) > BestDistance there is no use exploring the remaining part of the word We can eliminate that branch and BYPASS exploring that branch further

Bounded Median String Search BranchAndBoundMedianStringSearch(DNA,t,n,l ) s  (1,…,1) bestDistance  ∞ i  1 while i > 0 if i < l prefix  string corresponding to the first i nucleotides of s optimisticDistance  TotalDistance(prefix,DNA) if optimisticDistance > bestDistance (s, i )  Bypass(s,i, l, 4) else (s, i )  NextVertex(s, i, l, 4) word  nucleotide string corresponding to s if TotalDistance(s,DNA) < bestDistance bestDistance  TotalDistance(word, DNA) bestWord  word (s,i )  NextVertex(s,i,l, 4) return bestWord

Improving the Bounds Given an l-mer w, divided into two parts at point i u : prefix w1, …, wi, v : suffix wi+1, ..., wl Find minimum distance for u in a sequence No instances of u in the sequence have distance less than the minimum distance Note this doesn’t tell us anything about whether u is part of any motif. We only get a minimum distance for prefix u

Improving the Bounds (cont’d) Repeating the process for the suffix v gives us a minimum distance for v Since u and v are two substrings of w, and included in motif w, we can assume that the minimum distance of u plus minimum distance of v can only be less than the minimum distance for w

Better Bounds

Better Bounds (cont’d) If d(prefix) + d(suffix) > bestDistance: Motif w (prefix.suffix) cannot give a better (lower) score than d(prefix) + d(suffix) In this case, we can ByPass()

Better Bounded Median String Search ImprovedBranchAndBoundMedianString(DNA,t,n,l) s = (1, 1, …, 1) bestdistance = ∞ i = 1 while i > 0 if i < l prefix = nucleotide string corresponding to (s1, s2, s3, …, si ) optimisticPrefixDistance = TotalDistance (prefix, DNA) if (optimisticPrefixDistance < bestsubstring[ i ]) bestsubstring[ i ] = optimisticPrefixDistance if (l - i < i ) optimisticSufxDistance = bestsubstring[l -i ] else optimisticSufxDistance = 0; if optimisticPrefixDistance + optimisticSufxDistance > bestDistance (s, i ) = Bypass(s, i, l, 4) (s, i ) = NextVertex(s, i, l,4) word = nucleotide string corresponding to (s1,s2, s3, …, st) if TotalDistance( word, DNA) < bestDistance bestDistance = TotalDistance(word, DNA) bestWord = word (s,i) = NextVertex(s, i,l, 4) return bestWord

More on the Motif Problem Exhaustive Search and Median String are both exact algorithms They always find the optimal solution, though they may be too slow to perform practical tasks Many algorithms sacrifice optimal solution for speed

CONSENSUS: Greedy Motif Search Find two closest l-mers in sequences 1 and 2 and forms 2 x l alignment matrix with Score(s,2,DNA) At each of the following t-2 iterations CONSENSUS finds a “best” l-mer in sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences In other words, it finds an l-mer in sequence i maximizing Score(s,i,DNA) under the assumption that the first (i-1) l-mers have been already chosen CONSENSUS sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first 2 l-mers

Some Motif Finding Programs CONSENSUS Hertz, Stromo (1989) GibbsDNA Lawrence et al (1993) MEME Bailey, Elkan (1995) RandomProjections Buhler, Tompa (2002) MULTIPROFILER Keich, Pevzner (2002) MITRA Eskin, Pevzner (2002) Pattern Branching Price, Pevzner (2003)

Planted Motif Challenge Input: n sequences of length m each. Output: Motif M, of length l Variants of interest have a hamming distance of d from M

How to proceed? Exhaustive search? Run time is high

How to search motif space? Start from random sample strings Search motif space for the star

Search small neighborhoods

Exhaustive local search A lot of work, most of it unecessary

Best Neighbor Branch from the seed strings Find best neighbor - highest score Don’t consider branches where the upper bound is not as good as best score so far

Scoring PatternBranching use total distance score: For each sequence Si in the sample S = {S1, . . . , Sn}, let d(A, Si) = min{d(A, P) | P  Si}. Then the total distance of A from the sample is d(A, S) = ∑ Si  S d(A, Si). For a pattern A, let D=Neighbor(A) be the set of patterns which differ from A in exactly 1 position. We define BestNeighbor(A) as the pattern B  D=Neighbor(A) with lowest total distance d(B, S).

PatternBranching Algorithm

PatternBranching Performance PatternBranching is faster than other pattern-based algorithms Motif Challenge Problem: sample of n = 20 sequences N = 600 nucleotides long implanted pattern of length l = 15 k = 4 mutations

PMS (Planted Motif Search) Generate all possible l-mers from out of the input sequence Si. Let Ci be the collection of these l-mers. Example: AAGTCAGGAGT Ci = 3-mers: AAG AGT GTC TCA CAG AGG GGA GAG AGT

All patterns at Hamming distance d = 1 AAG AGT GTC TCA CAG AGG GGA GAG AGT CAG CGT ATC ACA AAG CGG AGA AAG CGT GAG GGT CTC CCA GAG TGG CGA CAG GGT TAG TGT TTC GCA TAG GGG TGA TAG TGT ACG ACT GAC TAA CCG ACG GAA GCG ACT AGG ATT GCC TGA CGG ATG GCA GGG ATT ATG AAT GGC TTA CTG AAG GTA GTG AAT AAC AGA GTA TCC CAA AGA GGC GAA AGA AAA AGC GTG TCG CAC AGT GGG GAC AGC AAT AGG GTT TCT CAT AGC GGT GAT AGG

Sort the lists AAG AGT GTC TCA CAG AGG GGA GAG AGT AAA AAT ATC ACA AAG AAG AGA AAG AAT AAC ACT CTC CCA CAA ACG CGA CAG ACT AAT AGA GAC GCA CAC AGA GAA GAA AGA ACG AGC GCC TAA CAT AGC GCA GAC AGC AGG AGG GGC TCC CCG AGT GGC GAT AGG ATG ATT GTA TCG CGG ATG GGG GCG ATT CAG CGT GTG TCT CTG CGG GGT GGG CGT GAG GGT GTT TGA GAG GGG GTA GTG GGT TAG TGT TTC TTA TAG TGG TGA TAG TGT

Eliminate duplicates AAG AGT GTC TCA CAG AGG GGA GAG AGT AAA AAT ATC ACA AAG AAG AGA AAG AAT AAC ACT CTC CCA CAA ACG CGA CAG ACT AAT AGA GAC GCA CAC AGA GAA GAA AGA ACG AGC GCC TAA CAT AGC GCA GAC AGC AGG AGG GGC TCC CCG AGT GGC GAT AGG ATG ATT GTA TCG CGG ATG GGG GCG ATT CAG CGT GTG TCT CTG CGG GGT GGG CGT GAG GGT GTT TGA GAG GGG GTA GTG GGT TAG TGT TTC TTA TAG TGG TGA TAG TGT

Find motif common to all lists Follow this procedure for all sequences Find the motif common all Li (once duplicates have been eliminated) This is the planted motif

PMS Running Time It takes time to Running time of this algorithm: Generate variants Sort lists Find and eliminate duplicates Running time of this algorithm: w is the word length of the computer