(Regulatory-) Motif Finding

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Hidden Markov Model in Biological Sequence Analysis – Part 2
Heaps1 Part-D2 Heaps Heaps2 Recall Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is a pair (key, value)
Comp 122, Spring 2004 Binary Search Trees. btrees - 2 Comp 122, Spring 2004 Binary Trees  Recursive definition 1.An empty tree is a binary tree 2.A node.
Exact Inference in Bayes Nets
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays.
Binary Search Trees Briana B. Morrison Adapted from Alan Eugenio.
Lecture 6, Thursday April 17, 2003
Data Structures: Trees i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst, Brian Hayes, or Glenn Brookshear.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Gene Regulation and Microarrays. Overview A. Gene Expression and Regulation B. Measuring Gene Expression: Microarrays C. Finding Regulatory Motifs.
Challenges for computer science as a part of Systems Biology Benno Schwikowski Institute for Systems Biology Seattle, WA.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Gibbs Sampling in Motif Finding. Gibbs Sampling Given:  x 1, …, x N,  motif length K,  background B, Find:  Model M  Locations a 1,…, a N in x 1,
1 Regulatory Motif Finding Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting, Blanchette & Tompa (2002) Statistical.
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication.
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CS273a Lecture 11, Aut 08, Batzoglou Multiple Sequence Alignment.
CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Comparative Motif Finding
Transcription factor binding motifs (part I) 10/17/07.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Regulatory motif discovery 6.095/ Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments.
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable.
Multiple Sequence alignment Chitta Baral Arizona State University.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Gene Regulation and Microarrays …after which we come back to multiple alignments for finding regulatory motifs.
Heaps and heapsort COMP171 Fall 2005 Part 2. Sorting III / Slide 2 Heap: array implementation Is it a good idea to store arbitrary.
Balanced Search Trees CS 3110 Fall Some Search Structures Sorted Arrays –Advantages Search in O(log n) time (binary search) –Disadvantages Need.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm Mathieu Blanchette Martin Tompa Computer Science & Engineering University of.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Comparative Sequence Analysis in Molecular Biology Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington,
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by.
1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Transcription factor binding motifs (part II) 10/22/07.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Chapter 5 : Trees.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Comparative RNA Structural Analysis
(Regulatory-) Motif Finding
Dynamic Programming II DP over Intervals
Presentation transcript:

(Regulatory-) Motif Finding

Finding Regulatory Motifs . Given a collection of genes with common expression, Find the TF-binding motif in common

Expectation Maximization Initialize parameters  = (M, B), : Try different values of  from N-1/2 up to 1/(2K) Repeat: Expectation Maximization Until change in  = (M, B),  falls below  Report results for several “good”  motif background  1 –  M1 M1 MK B A C G T

Gibbs Sampling in Motif Finding

Gibbs Sampling Given: x1, …, xN, motif length K, background B, Find: Model M Locations a1,…, aN in x1, …, xN Maximizing log-odds likelihood ratio:

Gibbs Sampling AlignACE: first statistical motif finder BioProspector: improved version of AlignACE Algorithm (sketch): Initialization: Select random locations in sequences x1, …, xN Compute an initial model M from these locations Sampling Iterations: Remove one sequence xi Recalculate model Pick a new location of motif in xi according to probability the location is a motif occurrence

Gibbs Sampling Initialization: Select random locations a1,…, aN in x1, …, xN For these locations, compute M: That is, Mkj is the number of occurrences of letter j in motif position k, over the total

Gibbs Sampling M Predictive Update: Select a sequence x = xi Remove xi, recompute model: M where j are pseudocounts to avoid 0s, and B = j j

Gibbs Sampling Sampling: For every K-long word xj,…,xj+k-1 in x: Qj = Prob[ word | motif ] = M(1,xj)…M(k,xj+k-1) Pi = Prob[ word | background ] B(xj)…B(xj+k-1) Let Sample a random new position ai according to the probabilities A1,…, A|x|-k+1. Prob |x|

Gibbs Sampling Running Gibbs Sampling: Initialize Run until convergence Repeat 1,2 several times, report common motifs

Advantages / Disadvantages Very similar to EM Advantages: Easier to implement Less dependent on initial parameters More versatile, easier to enhance with heuristics Disadvantages: More dependent on all sequences to exhibit the motif Less systematic search of initial parameter space

Repeats, and a Better Background Model Repeat DNA can be confused as motif Especially low-complexity CACACA… AAAAA, etc. Solution: more elaborate background model 0th order: B = { pA, pC, pG, pT } 1st order: B = { P(A|A), P(A|C), …, P(T|T) } … Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} } Has been applied to EM and Gibbs (up to 3rd order)

Phylogenetic Footprinting (Slides by Martin Tompa)

Phylogenetic Footprinting (Tagle et al. 1988) Functional sequences evolve slower than nonfunctional ones Consider a set of orthologous sequences from different species Identify unusually well conserved regions

Substring Parsimony Problem Given: phylogenetic tree T, set of orthologous sequences at leaves of T, length k of motif threshold d Problem: Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d. This problem is NP-hard.

Small Example Size of motif sought: k = 4 AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat) Size of motif sought: k = 4

Solution AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGG ACGT Parsimony score: 1 mutation

CLUSTALW multiple sequence alignment (rbcS gene) Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATT Pea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACA Tobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACC Ice-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC Turnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC Wheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA Duckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA Larch TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC Cotton CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A Pea C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------A Tobacco AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-plant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA Turnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------A Wheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC-------- Duckweed ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT Larch TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA Pea GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA Tobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG Ice-plant GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG Turnip CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA Wheat CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG Duckweed TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC Larch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC Pea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC Tobacco CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plant TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC Larch TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA Turnip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG Wheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC Duckweed CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG

An Exact Algorithm (generalizing Sankoff and Rousseau 1975) Wu [s] = best parsimony score for subtree rooted at node u, if u is labeled with string s. AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2 ACGT: 1 ... … ACGG: 0 ACGT: 2 ... … ACGG: 1 ACGT: 1 ... … ACGG: + ACGT: 0 … ACGG: 1 ACGT: 0 ... 4k entries … ACGG: 0 ACGT: + ... … ACGG: ACGT :0 ...

Running Time Total time O(n k (42k + l )) Wu [s] =  min ( Wv [t] + d(s, t) ) v: child t of u Average sequence length O(k42k ) time per node Number of species Total time O(n k (42k + l )) Motif length

Limits of Motif Finders ??? gene Given upstream regions of coregulated genes: Increasing length makes motif finding harder – random motifs clutter the true ones Decreasing length makes motif finding harder – true motif missing in some sequences

Limits of Motif Finders A (k,d)-motif is a k-long motif with d random differences per copy Motif Challenge problem: Find a (15,4) motif in N sequences of length L CONSENSUS, MEME, AlignACE, & most other programs fail for N = 20, L = 1000

Example Application: Motifs in Yeast Group: Tavazoie et al. 1999, G. Church’s lab, Harvard Data: Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.) 15 time points across two cell cycles

Processing of Data Selection of 3,000 genes Genes with most variable expression were selected Clustering according to common expression K-means clustering 30 clusters, 50-190 genes/cluster Clusters correlate well with known function AlignACE motif finding 600-long upstream regions 50 regions/trial

Motifs in Periodic Clusters

Motifs in Non-periodic Clusters

Rapid Global Alignments How to align genomic sequences in (more or less) linear time

Motivation Genomic sequences are very long: Human genome = 3 x 109 –long Mouse genome = 2.7 x 109 –long Aligning genomic regions is useful for revealing common gene structure Useful to compare regions > 1,000,000-long

Main Idea Genomic regions of interest contain ordered islands of similarity, such as genes Find local alignments Chain an optimal subset of them Refine/complete the alignment Systems that use this idea to various degrees: MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA, & others

Saving cells in DP Find local alignments Chain -O(NlogN) L.I.S. Restricted DP

Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

Quadratic Time Solution Build Directed Acyclic Graph (DAG): Nodes: local alignments [(xa,xb)  (ya,yb)] & score Directed edges: local alignments that can be chained edge ( (xa, xb, ya, yb) , (xc, xd, yc, yd) ) xa < xb < xc < xd ya < yb < yc < yd Each local alignment is a node vi with alignment score si

Quadratic Time Solution Dynamic programming: Initialization: Find each node va s.t. there is no edge (u,v0) Set score of V(a) to be sa Iteration: For each vi, optimal path ending in vi has total score: V(i) = max ( weight(vj, vi) + V(j) ) Termination: Optimal global chain: j = argmax ( V(j) ); trace chain from vj Worst case time: quadratic

Sparse Dynamic Programming Back to the LCS problem: Given two sequences x = x1, …, xm y = y1, …, yn Find the longest common subsequence Quadratic solution with DP How about when “hits” xi = yj are sparse?

Sparse Dynamic Programming 15 3 24 16 20 4 11 18 4 20 24 3 11 15 18 Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

Sparse Dynamic Programming – L.I.S. Longest Increasing Subsequence Given a sequence over an ordered alphabet x = x1, …, xm Find a subsequence s = s1, …, sk s1 < s2 < … < sk

Sparse LCS expressed as LIS Create a sequence w Every matching point x-to-y, (i, j), is inserted into a sequence as follows: For each position j of x, from smallest to largest, insert in z the points (i, j), in decreasing column i order The 11 example points are inserted in the order given Any two points (ya, xa), (yb, xb) can be chained iff a is before b in w, and ya < yb 15 3 24 16 20 4 11 18 6 4 2 7 1 8 10 9 5 11 3 4 20 24 3 11 15 18 y

Sparse LCS expressed as LIS Create a sequence w w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) Consider now w’s elements as ordered lexicographically, where (ya, xa) < (yb, xb) if ya < yb Claim: An increasing subsequence of w is a common subsequence of x and y 15 3 24 16 20 4 11 18 6 4 2 7 1 8 10 9 5 11 3 4 20 24 3 11 15 18 y

Sparse Dynamic Programming for LIS x Algorithm: initialize empty array L /* at each point, lj will contain the last element of the longest j-long increasing subsequence that ends with the smallest wi */ for i = 1 to |w| binary search for w[i] in L, to find lj < w[i] ≤ lj+1 replace lj+1 with w[i] keep a backptr lj  w[i] That’s it!!! 15 3 24 16 20 4 11 18 6 4 2 7 1 8 10 9 5 11 3 4 20 24 3 11 15 18 y

Sparse Dynamic Programming for LIS x Example: w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) L = (4,2) (3,3) (3,3) (10,5) (2,5) (10,5) (2,5) (8,6) (1,6) (8,6) (1,6) (3,7) (1,6) (3,7) (4,8) (1,6) (3,7) (4,8) (7,9) (1,6) (3,7) (4,8) (5,9) (1,6) (3,7) (4,8) (5,9) (9,10) 15 3 24 16 20 4 11 18 6 4 2 7 1 8 10 9 5 11 3 4 20 24 3 11 15 18 y Longest common subsequence: s = 4, 24, 3, 11, 18

Sparse DP for rectangle chaining 1,…, N: rectangles (hj, lj): y-coordinates of rectangle j w(j): weight of rectangle j V(j): optimal score of chain ending in j L: list of triplets (lj, V(j), j) L is sorted by lj L is implemented as a balanced binary tree h l y

Sparse DP for rectangle chaining Go through rectangle x-coordinates, from lowest to highest: When on the leftmost end of i: j: rectangle in L, with largest lj < hi V(i) = w(i) + V(j) When on the rightmost end of i: j: rectangle in L, with largest lj  li If V(i) > V(j): INSERT (li, V(i), i) in L REMOVE all (lk, V(k), k) with V(k)  V(i) & lk  li

Example x 2 1: 5 5 6 2: 6 9 10 3: 3 11 12 14 4: 4 15 5: 2 16 y

Time Analysis Sorting the x-coords takes O(N log N) Going through x-coords: N steps Each of N steps requires O(log N) time: Searching L takes log N Inserting to L takes log N All deletions are consecutive, so log N per deletion Each element is deleted at most once: N log N for all deletions Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree